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SEQUENCES CHARACTERISTIC OF HUMAN GENE TRANSCRIPTION 

PRODUCT 

5 

Technical Field 

The present invention relates to newly identified 
polynucleotide sequences corresponding to transcription 
products of human genes, and to complete gene sequences 
10 associated therewith. 



This invention relates to human genes. Identification 

15 and sequencing of human genes is. a major goal of modern 
scientific research. The sequence of human genes is more 
than just a scientific curiosity. For example, by 
identifying genes and determining their sequences, scientists 
have been able to make large quantities of valuable human 

20 "gene products." These include human insulin, interferon, 
Factor VIII, tumor necrosis factor, human growth hormone, 
tissue plasminogen activator, and numerous other compounds. 
Additionally, knowledge of gene sequences can provide the key 
to treatment or cure of genetic diseases (such as muscular 

25 dystrophy and cystic fibrosis) . The present invention 
represents a quantum leap forward in mankind's knowledge of 
human gene sequences. 

There are several basic concepts of molecular biology 
which figure prominently in the invention. A brief 

30 explanation of those concepts follows. Additional background 
information and definitions for scientific terms can be found 
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in the literature. See, for example, "Glossary of Genetics, 
Classical and Molecular* by R. Rieger, A. Michaelis, and M.M. 
Green {Fifth Edition, Springer -Ver lag, New York (1991)). The 
contents of this and other publications cited in the 
5 specification are incorporated by reference herein. 

At an initial level, the present invention is based on 
identification and characterization of gene segments. Genes 
are the basic units of inheritance. Each gene is a string of 
connected bases called nucleotides. Most genes are formed of 

10 deoxyribonucleic acid, DNA. (Some viruses contain genes of 
ribonucleic acid, RNA. ) The genetic information resides in 
the particular sequence in which the bases are arranged. A 
short sequence of nucleotides is often called a 
polynucleotide or an oligonucleotide. 

15 Like genes, polypeptides are built from, long strings of 

individual units. These units are amino acids. The 
nucleotide sequence of a gene tells the cell the sequence in 
which to arrange the amino acids to make the polypeptide 
encoded by that gene. In general, chains of up to about 200 

20 amino acids are called polypeptides, while proteins are 
larger molecules made up of polypeptide subunits; both types 
of molecules are referred to generally herein * as 
polypeptides. A triplet of nucleotides (codon) in DNA codes 
• for each amino acid or signals the beginning or end of the 

25 message (anticodon) . The term codon is also used for the 
corresponding (and complementary) sequences of . three 
nucleotides in the mRNA into which the original DNA sequence 
is transcribed. 

Generally, enzymes in the cell transcribe the permanent 

30 DNA of the gene into a temporary RNA copy, called messenger 

RNA or mRNA. The mRNA, in turn, can be translated into a 
polypeptide by the cell. This entire process is called gene 
expression, and the polypeptide is the gene product encoded 
by the gene . 

35 Scientists have previously discovered how to reverse the 

transcription process and copy mRNA back into DNA using an 
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enzyme called reverse transcriptase. The resulting is called 
complementary DNA, or cDNA. This is schematically shown in 
the single Figure. When substantially all of the mRNA from 
one cell or tissue is converted to cDNA at once and cloned 
5 into multiple copies of a recombinant vector to allow 
replication and manipulation in the laboratory, the result is 
called a cDNA library. 

The various types of genes include those which code for 
polypeptides, those which are transcribed into RNA but are 

10 not translated into polypeptides, and those whose functional 
significance does not demand that they be transcribed at all. 
Most genes are found on large molecules of DNA located in 
chromosomes. Double stranded cDNA carries all the 
information of a gene. Each base of the first strand is 

15 joined to a complementary base (hybridized) in the second 
strand. The linear DNA molecules in chromosomes have 
thousands of genes distributed along their length. 
Chromosomes include both coding regions (coding for 
polypeptides) and noncoding regions; the coding regions. 

20 represent only about three percent of the total chromosome 
sequence . 

An individual gene has regulatory regions that include 
a promoter which directs expression of the gene, a coding: 
region which can code for a polypeptide, and a termination 

25 signal. The regulatory DNA sequence is usually a noncoding 

region that determines if, where, when, and at what level a 
particular gene is expressed. 

The coding regions of many genes are discontinuous, with 
coding sequences (exons) alternating with noncoding regions 

30 (introns) . The final mRNA copy of the gene does not include 

these introns (which can be much longer than the coding 
region itself) , although it does contain certain untranslated 
regions that usually do not code for the polynucleotide gene 
product . Untranslated sequences at the beginning and end of 

35 the mRNA are known as 5 ! - and 3 1 -untranslated' regions, 
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respectively. This nomenclature reflects the orientation of 
the nucleotide constituents of the mRNA. 

A cDNA is a DNA copy of a messenger RNA, which 
contains all of the exons of a gene. The cDNA can be thought 
5 of as having three parts: an untranslated 5' leader, an 

uninterrupted polypeptide -coding sequence, and a 3 1 
untranslated region. The untranslated leader and trailing 
sequences are important for initiation of translation, mRNA 
stability, and other functions. The untranslated leader and 

10 trailing sequences are called 5 1 - and 3 1 -untranslated 

sequences, respectively. The 3 1 untranslated sequence is 
usually longer than the 5 1 untranslated leader, and pan be 
longer than the polypeptide -coding sequence. The 
untranslated regions typically have many, randomly- 

15 distributed stop codons, and do not display the nonrandom 
base arrangements found in coding sequences. The 5 1 - 
untranslated sequence is relatively short, generally between 
20 and 200 bases. The 3 1 -untranslated sequence is often many 
times longer, up to several thousand bases. 

20 The translated or coding sequence begins with a 

trans lational start codon (AUG or GUG) and ends with a 
translational stop codon (UAA, UGA, or UAG) . Generally, 
translation begins at the first "start" codon on the. mRNA and 
proceeds to the first "stop" codon. Coding sequences can be 

25 distinguished by their nonrandom distribution of bases; 

numerous computer algorithms have been developed to 
distinguish coding from noncoding regions in this way. 

Human DNA differs from person to person. No two persons 
(except perhaps identical twins) have identical DNA, While 

30 the differences, called allelic variations or polymorphisms, 

are slight on a molecular level, they account for most of the 
physical and other observable differences between 
individuals. It has been estimated that approximately 14 
million sequence polymorphism differences exist between 

35 individuals. 
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The ability of one strand of DNA to attach or hybridize 
to a complementary strand haB already been exploited for 
several purposes. For example, small pieces of DNA (15 to 25 
base pairs long) can be made which will hybridize to longer 
5 strands of DNA which have a complementary sequence. These 
short "primers" can be selected such that they hybridize to 
a specific, unique location on the longer strand. Once the 
primers have hybridized to their target on the DNA, the 
polymerase chain reaction (PCR) can be employed to generate 

10 millions of copies of (or amplify) the particular segment of 
DNA between the locations to which two primers are bound. 
Briefly, this technique allows amplification of a DNA region 
situated between two convergent primers, using 
oligonucleotide primers that hybridize to opposite strands. 

15 Primer extension proceeds inward across the region between 
the two primers, and the product of DNA synthesis of one 
primer serves as a template for the other primer. Repeated 
cycles of DNA denaturation, annealing of primers, and 
extension result in an exponential increase in the number of 

20 copies of the region bounded by the primers. 

Similarly, a labeled segment of single -stranded DNA can 
be hybridized to a longer DNA sequence, such as a chromosome, 
to mark a specific location on the longer sequence. . Segments 
■ of DNA 50 bases long or longer that hybridize to a unique DNA 

25 location in the human genome are extremely unlikely to 
hybridize elsewhere in the human genome. 

The Human Genome Project is an effort to sequence all 
human DNA (the human genome) . The human genome is estimated 
to comprise 50,000 - 100,000 genes, up to 30,000 of which 

30 might be expressed in the brain (Sutcliffe, Ann. Rev. 

Neurosci. 11:157 (1988)). Once dedicated human chromosome 
sequencing begins in three to five years, it was expected 
that 12-15 years will be required to complete the sequence of 
the genome (Report of the Ad Hoc Program Advisory Committee 

35 on Complex Genomes, Reston, Va., Feb, 1988, D. Baltimore Ed. 

(NIH, Bethesda, Md, 1988)), At that rate, the majority of 
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human genes would remain unknown for at least the next 
decade. The present invention can greatly accelerate the 
pace at which human genes can be identified and mapped. Most 
gene researchers, in conjunction with publication of their 
5 results in this field, submit sequence data to the GenBank 
database. Prior to the present invention, GenBank listed the 
sequences of only a few thousand human "genes and less than 
two hundred human brain mRNAs (GenBank Release 66.0, 
December, 1990) . 

10 The role of sequencing complementary DNA (cDNA) , reverse 

transcribed from mRNA, as a part of the human genome project 
has been vigorously debated since the idea of determining the 
complete nucleotide sequence of humans first surfaced. The 
coding sequence of all human genes represents most of the 

15 information content of the genome, but only 3-5% of the total 
DNA. In contrast, cDNA (which is only made from the 
transcription product of active genes) is one-half to three - 
fourths (the remainder being 5 1 - and 3 1 -untranslated 
sequence) meaningful genetic information. Thus, some have 

20 argued that cDNA sequencing should take precedence over 

genomic sequencing (Brenner, CIBA Found. Syxnp. 149:6 (1990)). 
However, until now, such arguments have not been heeded. 

Genomic sequencing proponents have argued the difficulty 
of finding every mRNA expressed in all tissues, cell types, 

25 and developmental states, and that much valuable information 
from intronic and intergenic regions, including control and 
regulatory sequences, will be missed by cDNA sequencing. 
(Report of the Committee on Mapping and Sequencing the Human 
Genome, National Research Council (National Academy Press, 

30 Washington, D.C. 1988)). Further, sequencing of transcribed 

regions of the genome using cDNA libraries has heretofore 
been considered impractical or unsatisfactory. Libraries of 
cDNA were believed to be dominated by repetitive elements, 
mitochondrial genes, ribosomal RNA genes, and other nuclear 

35 genes comprising common or housekeeping sequences. It was 

believed that cDNA libraries would provide few sequences 
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cor responding to structural and regulatory polypeptides or 
peptides. See, for example, Putney, et al. f Nature 302i718- 
721 (1983) . Putney, et al. sequenced over 150 clones from a 
rabbit muscle cDNA library and identified clones for 13 of 
5 the 19 known muscle polypeptides, including one new isotype 
but no unknown coding sequences. 

Another perceived drawback of cDNA sequencing was that 
some mRNAs are abundant, and some are rare. The cellular 
quantities of mRNA from various genes can vary by several 

10 orders of magnitude. This led critics to believe that most 
information obtained from cDNA sequencing would be 
repetitious and useless. 

The present invention demonstrates that, despite such 
skepticism, cDNA sequencing now provides a rapid method for 

15 obtaining enormous amounts of valuable genetic information 
and DNA products of great utility for the biotechnology and 
pharmaceutical industries. Not only can many distinct cDNAs 
be isolated and sequenced, even partial cDNAs can be used, 
with conventional, well -understood methods, to isolate entire 

20 genes, and to determine the chromosomal locations and 
biological functions of these genes. As is demonstrated 
here, fragments of only a few hundred bases are sufficient, 
in many cases, to identify the probable function of a new 
human gene if it is similar in structure to a gene from 

25 another animal, or from plants or bacteria. Similarly, even 
fragments of untranslated regions of a cDNA can be used to: 
i) isolate the coding sequence of the cDNA; ii) isolate the 
complete gene; iii) determine the position of the gene on a 
human chromosome, and hence the potential of the gene to 

30 cause a human genetic disease; and iv) determine the function 
of the gene by means of experiments in which the function of 
the native gene is disrupted by the addition of a short DNA 
fragment to the cell, e.g., using triple helix or antisense 
probes . 

35 Because coding regions comprise such a small portion of 

the human genome, identification and mapping of transcribed 
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regions and coding regions of chromosomes is of significant 
interest . There is a corresponding need for reagents for 
identifying and marking coding regions and transcribed 
regions of chromosomes. Furthermore, such human sequences are 
5 valuable for chromosome mapping, human identification, 
identification of tissue type and origin, forensic 
identification, and locating disease-associated genes (i.e. , 
genes that are associated with an inherited human disease, 
whether through mutation, deletion, or faulty gene 
10 expression) on the chromosome. 

SUMMARY OF THE INVENTION 

Contrary to the expectations of the scientific 
community, cDNA screening and sequencing techniques have now 

15 been used to discover a large number of heretofore unknown 
human genes. Disclosed herein are over 2,400 new human 
polynucleotide sequences . These sequences could represent up 
to 5% of all human genes. The novelty of these sequences has 
been established through comparison to both nucleotide 

20 sequence databases and amino acid sequence databases. 

Surprisingly, over 80% of the sequences generated were 
unrelated to any sequences previously described in the 
literature. 

The sequences of the present invention were ascertained 
25 usingr a fast approach to cDNA characterization. This 
approach could facilitate the tagging of most expressed human 
genes within a few years at a fraction of the cost of 
complete genomic sequencing, provide new genetic markers, 
provide new DNA-based therapeutics and diagnostics, and 
30 provide other valuable nucleotide reagents. 

The sequences disclosed herein, styled Expressed 
Sequence Tags ("ESTs"), are markers for human genes actually 
transcribed in vivo. Techniques are disclosed for using 
these ESTs to obtain the full coding region of the 
35 corresponding gene. The use of ESTs, complete coding 

sequences, or fragments thereof for marking chromosomes, for 
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mapping locations of expressed genes on chromosomes, for 
individual or forensic identification, for mapping locations 
of disease-associated genes, for identification of tissue 
type, and for preparation of antisense sequences, probes, and 
5 constructs is discussed in detail below. Unlike the random 
genomic DNA sequence tagged sites (STSs) (Olson et al., 
Science 245:1434 (1989)), ESTs point directly to expressed 
genes . 

Various aspects of the present invention thus include 
10 the individual ESTs, corresponding partial and complete cDNA, 
genomic DNA, mRNA, antisense strands, triple helix probes, 
PCR primers, coding regions, and constructs. Also, where one 
skilled in the art is enabled by this specification to 
prepare expression vectors and polypeptide expression 
15 products, they are also within the scope of the present 
invention, along with antibodies, especially monoclonal 
antibodies, to such expression products. 

DEjSCftXmoy QF THE DfiAW^G 
20 The single drawing Figure schematically illustrates the 

progression from chromosome to gene to mRNA to cDNA. 



DETAILED DESCRIPTION OF THE INVENTION 
25 The detailed description that follows provides not only 

. the actual sequence of each new EST, but also explains how 
the ESTs were obtained, how to obtain the corresponding 
complete cDNA sequence and the corresponding genomic DNA 
sequence, how to make DNA constructs from the ESTs and 
30 corresponding sequences, how to use those sequences as 
reagents in molecular biology and other fields, how to 
produce gene products from the ESTs and corresponding 
sequences and antibodies to those gene products, and the 
functional categories of many ESTs and corresponding genes. 
35 Furthermore, numerous actual working examples and predictive 
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examples are provided to demonstrate and exemplify numerous 
aspects of the invention. 

1^ ESTs from cDNA Libraries 

5 The sequences of the present invention were isolated 

from commercially available and custom made cDNA libraries 
using a rapid screening and sequencing technique. In 
general, the method comprises applying conventional automated 
DNA sequencing technology to screening clones, advantageously" 

10 randomly selected clones, from a cDNA library. Preferably, 
the library is initially "enriched" through removal of 
ribosomal sequences and other common sequences prior to clone 
selection. According to the present method, ESTs are 
generated from partial DNA sequencing of the selected clones . 

15 The ESTs of the present invention were generated using low 
redundancy of sequencing, typically a single sequencing 
reaction. While single sequencing reactions may have an 
accuracy as low as 97%, this nevertheless provides sufficient 
. fidelity for identification of the sequence and design of PCR 

20 primers. 

Most human genes can be identified by EST sequencing 
from libraries of cDNA copies of messenger RNAs. However, 
some genes are expressed only at specific times during 
embryonic development, or only in small amounts in a few 

25 specific cell types. Other genes have mRNAs that are 

degraded very quickly by the cell in which they are 
expressed; If any of these are the case, transcripts of the 
gene will not be represented in cDNA libraries so the gene 
will not be identifiable by EST sequencing. A new method 

30 called "exon amplification", however, can be used to isolate 

and identify transcripts of such genes. 

Exon amplification works by artificially expressing part 
or all of a gene that is contained in a cloned fragment of 
genomic DNA such as a cosmid or yeast artificial chromosome 

35 (YAC) . The gene is cloned into a special vector, designed at 

MIT, that uses control elements from virus genes to express 
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the protein -coding exons of the human gene of interest. Exon 
trapping shows considerable promise as a general technique 
for identifying those genes in the human genome that cannot 
be found by cDNA cloning and EST sequencing. Exon 
5 amplification will also be useful for identifying the genes 
in regions of genomic DNA to which disease genes have been 
mapped. The exon amplification method can be used directly 
with the cosmid and YAC clones frown human chromosomes that 
are being obtained by both NIH and DOE supported human genome 

10 centers. ESTs comprise DNA sequences corresponding to 

a portion of nuclear encoded messenger RNA. An EST is of 
sufficient length to permit: (1) amplification of the 
specific sequence from a cDNA library, e.g., by polymerase 
chain reaction (PCR) ; (2) use of a synthetic polynucleotide 

15 corresponding to a partial or complete sequence of the EST as 
a hybridization probe of a cDNA library, generally having 30 
- 50 base pairs; or (3) unique designation of the pure cDNA 
clone from which the EST was derived (the EST clone) for use 
as a hybridization probe of a cDNA library. Preferably, EST- 

20 derived primer pairs and sequences amplify or detect ably 
hybridize to a sequence from a genomic library. 

It has been found that sufficient information is 
contained in the 150-400 base ESTs from one sequencing. run to 
effect preliminary identification and exact chromosome 

25 mapping. Accordingly, the ESTs disclosed herein are generally 
at least 150 base pairs in length. The length of an EST is 
determined by the quality of sequencing data and the length 
of the cloned cDNA. Raw data from the automated sequencers 
is edited to remove low quality sequence at the end of the 

30 sequencing run. High quality sequences (usually a result of 

sequencing templates without excessive salt contamination) 
generally give about 400 bp of reliable sequence data; other 
sequences give fewer bases of reliable data. A 150 bp EST is 
long enough to be translated into a 50 amino acid peptide 

35 sequence. This length is sufficient to observe similarities 

when they exist in a database search. Furthermore, 150 bp is 
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long enough to design PCR primers from each end of the 
sequence to amplify the complete EST. Sequences shorter than 
150 bp are difficult to purify and use following PCR 
amplification. Furthermore, a 150 bp polynucleotide is 
5 likely to give a very strong signal with low background in a 
screen of a genomic library. 

Finally, it is highly unlikely that a sequence of the 
same 150 bp exists in any genes in the genome besides the one 
tagged by the EST* Some closely related gene family members 

10 have very 'similar nucleotide sequences, but no examples of 
pairs of human genes with long segments of identical sequence 
have been reported to date. For instance, there are three 
known |8- tubulin genes in humans. Several ESTs were found 
that matched one or another of these tubulin genes, but 

15 several new members of this gene family were also found and 
could be clearly distinguished from the three known members. 
ESTs that match perfectly to several different genes can be 
detected by hybridizing to chromosomes : if many chromosomal 
loci are observed, the sequence (or a close variant) is 

20 present in more than one gene. This problem can be 
circumvented by using the 3 ' -untranslated part of the cDNA 
alone as a probe for the chromosomal location or for the 
full-length cDNA or gene. The 3 1 -untranslated region is more 
. likely to be unique within gene families, since there is no 

25 evolutionary pressure to conserve a coding function of this 
region of the mRNA- 

As demonstrated in the Examples that follow, ESTs can be 
used to map the expressed sequence to a particular 
chromosome. In addition, ESTs can be expanded to provide the 

30 full coding regions, as detailed below. In this manner, 

previously unknown genes can be identified. 

While a variety of cDNA libraries can be used to obtain 
ESTs, human brain cDNA libraries are exemplified and 
represent a preferred embodiment. Suitable cDNA libraries 

35 can be freshly prepared or obtained commercially, e.g., as 

shown in Examples 1, 2, and 11. The cDNA libraries from the 
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desired tissue are preferably preprocessed by conventional 
techniques to reduce repeated sequencing of high and 
intermediate abundance clones and to maximize the chances of 
finding rare messages from specific cell populations. 
5 Preferably, preprocessing includes the use of defined 
composition prescreening probes, e.g., cDNA corresponding to 
mitochondria, abundant sequences, ribosomes, act ins, myelin 
basic polypeptides, or any other known high abundance 
peptide; these prescreening probes used for preprocessing are 

10 generally derived from known ESTs. Other useful 

preprocessing techniques include subtraction, which 
preferentially reduces the population of certain sequences in 
the library (e.g., see A. Swaroop et al., Nucl. Acids Res. 
19 , 1954 (1991)), and normalization, which results in all 

15 sequences being represented in approximately equal 
proportions in the library (Patanjali et al, Proc* Natl. 
Acad. Sci. USA 88:1943 (1991)). 

The cDNA libraries used in the present method will 
ideally use directional cloning methods so that either the 5 ' 

20 end of the cDNA (likely to contain coding sequence) or the 3' 
end (likely to be a non-coding sequence) can be selectively 
obtained. n 

Libraries of cDNA can also be generated from recombinant 
expression of genomic DNA. After they are amplified, ESTs 
25 can be obtained and sequenced, e.g., as illustrated in 
Example 11. 

The sequences of the present invention include the 
specific sequences set forth in the Sequence Listing and 
designated SEQ ID NO: 1 - SEQ ID NO: 2412. In one aspect of 

30 this embodiment, the invention relates to those sequences of 
SEQ ID NOS: 1 - 2412 that comprise the cDNA coding sequences 
for polypeptides having less than 95% identity with known 
amino acid sequences (see Table 2) and more preferably less 
than 90% or 85% identity. In a second aspect, the invention 

35 relates to those sequences of SEQ ID NOS: 1 - 2412 that 

encode polypeptides having no similarity to known amino acid 
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sequences (see Examples that follow) . Precisely because they 
do not contain coding regions and are therefore more unique 
in their sequence structures, those sequences which meet 
neither of the preceding criteria can be most useful and are 
5 generally preferred for mapping. 

Consistent with the NIH mission and its responsibilities 
to disseminate knowledge and share the tangible fruits of its 
research, the present inventors have taken a number of steps, 
to facilitate sequence data and clone availability. All EST 

10 sequences "have been submitted to GenBank (representing an 

addition equivalent to 7% of the human nucleotides in Release 
69 of GenBank, September 1991) . The corresponding cDNA 
clones have been submitted to the American Type Culture 
Collection and information on clones and sequences has been 

15 submitted to the Genome Data Base (Pearson, P. Nucl. Acids 

Res. 19 (Suppl.) : 2237-9 (1991)). 

II . Complete Coding Sequences from ESTs 

The ESTs of the present invention generally represent 

20 relatively small coding regions or untranslated regions of 

human genes. Although most of these sequences do not code 
for a complete gene product, the ESTs of the present 
invention are highly specific markers for the corresponding 
complete coding regions. The ESTs are of sufficient length 

25 that they will hybridize, under stringent . conditions, only 

with DNA for that gene to which they correspond. Suitably 
stringent conditions comprise conditions, for example, where 
at least 95%, preferably at least 97% or 98% identity (base 
pairing), is required for hybridization. This property 

30 permits use of the EST to isolate the entire coding region 

and even the entire sequence. Therefore, only routine 
laboratory work is necessary to parlay the unique EST 
sequence into the corresponding unique complete gene 
sequence . 

35 Thus, each of the ESTs of the present invention 

"corresponds" to a particular unique human gene. Knowledge 
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of the EST sequence permits routine isolation and sequencing 
of the complete coding sequence of the corresponding gene. 
The complete coding sequence is present in a full-length cDNA 
clone as well as in the gene carried on genomic clones. 
5 Therefore, each EST "corresponds" to a cDNA (from which the 
EST was derived), a complete genomic gene sequence, a 
polypeptide coding region (which can be obtained either from 
the cDNA or genomic DNA) , and a polypeptide or amino acid 
sequence encoded by that region. 
10 The first step in determining where an EST is located in 

the cDNA is to analyze the EST for the presence of coding 
sequence, e.g., as described in Example 14. The CRM program 
predicts the extent and orientation of the coding region of 
a sequence. Based on this information, one can infer the 
15 presence of start or stop codons within a sequence and 
whether the sequence is completely coding or completely non- 
coding. If start or stop codons are present, then the EST 
can cover both part of the 5 1 -untranslated or 3 ' -untranslated 
part of the mRNA (respectively) as well as part of the coding 
20 sequence. If no coding sequence is present, it is likely 
that the EST is derived from the 3 1 -untranslated sequence due 
to its longer length and the fact that most cDNA library 
construction methods are biased toward the 3 1 end of the 
mRNA. 

25 One general procedure for obtaining complete sequences 

from ESTs is as follows: 

1. Purify selected human DNA from an EST clone (the 
cDNA clone that was sequenced to give the EST) , e.g., by 
endonuclease digestion using ECOR1, gel electrophoresis, and 

30 isolation of the aforementioned clone by removal from low- 
melting agarose gel . 

2. Radiolabel the isolated insert DNA, e.g., with 
labels, preferably by nick translation or random primer 
labeling. 

35 3. Use the labeled EST insert as a probe to screen a 

lambda phage cDNA library or a plasmid cDNA library. 
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4. Identify colonies containing clones related to the 
probe cDNA and purify them by known purification methods, 

5. Nucleotide sequence the ends of the newly purified 
clones to identify full length sequences. 

5 6 . Perform complete sequencing of full length clones 

by Exonuclease III digestion or primer walking. Northern, 
blots of the rnRNA from various tissues using at least part of 
the EST clone as a probe can optionally be performed to check 
the size of the rnRNA against that of the purported full 

10 length cDNA. 

An EST is a specific tag for a messenger RNA molecule. 
The complete sequence of that messenger RNA, in the form of 
cDNA, can be determined using the EST as a probe to identify 
a cDNA clone corresponding to a full-length transcript/ 

15 followed by sequencing of that clone. The EST or the full- 
length cDNA clone can also be used as a probe to identify a 
genomic clone or clones that contain the complete gene 
including regulatory and. promoter regions, exons, and 
introns . 

20 ESTs are used as probes to identify the cDNA clones from 

which an EST was derived. ESTs, or portions thereof, can be 
nick-translated or end- labelled with P 32 using polynucleotide 
kinase using labelling methods known to those with skill in 
the art. (Basic Methods in Molecular Biology, L.G. Davis, M.D. 

25 Dibner, and J.F. Battey, ed., Elsevier Press, NY, 1986). The 
•lambda library can be directly screened with the labelled 
ESTs of interest or the library can be converted en masse to 
pBluescript (Stratagene, La Jolla, California) to facilitate 
bacterial colony screening. Both methods are well known in 

30 the art. Briefly, filters with bacterial colonies containing 
the library in pBluescript or bacterial lawns containing 
lambda plaques are denatured and the DNA is fixed to the 
filters. The filters are hybridized with the labelled probe 
using .hybridization conditions described by Davis et al . The 

35 ESTs , cloned into, lambda or pBluescript, can be used as 
positive controls to assess background binding and to adjust 
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the hybridization and washing stringencies necessary for 
accurate clone identification. The resulting autoradiograms 
are compared to duplicate plates of colonies or plaques; each 
exposed spot corresponds to a positive colony or plaque. The 
5 colonies or plaques are selected, expanded and the DNA is 
isolated from the colonies for further analysis and 
sequencing. 

The ESTs can additionally be used to screen Northern 
blots of mRNA obtained from various tissues or eel ^cultures, 

10 including the tissue of origin of the EST clone. Northern 
analysis will most often produce one to several positive 
bands. The bands can be selected for further study based on 
the predicted size of the mRNA. 

Positive cDNA clones in phage lambda are analyzed to 

15 determine the amount of additional sequence they contain 
using PCR with one primer from the EST and the other primer 
from the vector. Clones with a larger vector-insert PCR 
product than the original EST clone are analyzed • by 
restriction digestion and DNA sequencing to determine whether 

20 they contain an insert of the same size or similar as the 
mRNA size on a Northern blot. 

Once one or more overlapping cDNA clones are identified, 
the complete sequence of the clones can be determined. The 
preferred method is to use exonuclease III digestion 

25 (McCombie, W.R, Kirkness, E., Fleming, J.T. , Kerlavage, A.R. , 

Iovannisci, D.M., and Martin-Gallardo, R., Methods: 3: 33- 
40, 1991) . A series of deletion clones is generated, each of 
which is sequenced. The resulting overlapping sequences are 
assembled into a single contiguous sequence of high 

30 redundancy (usually three to five overlapping sequences at 

each nucleotide position) , resulting in a highly accurate 
final sequence. 

A similar screening and clone selection approach can be 
applied to obtaining cosmid or lambda clones from a genomic 

35 DNA library that contains the complete gene from which the 
EST was derived (Kirkness, E.F., Kusiak, J.W., Menninger, J., 
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Gocayne, J.D., Ward, D.C., and Venter, J.C. Genomics 10 ; 985- 
995 (1991) • Although the process is much more laborious, 
these genomic clones can be sequenced in their entirety also. 
A shotgun approach is preferred to sequencing clones with 
5 inserts longer than 10 kb (genomic cosmid and lambda clones) . 
In shotgun sequencing, the clone is randomly broken into many 
small pieces, each of which is partially sequenced. The 
sequence fragments are then aligned to produce the final 
contiguous sequence with high redundancy. An intermediate 

10 approach is to sequence just the promoter region and the 
intron-exon boundaries and to estimate the size of the 
introns by restriction endonuclease digestion (ibid.). 

Using the sequence information provided herein, the 
polynucleotides of the present invention can be. derived from 

15 natural sources or synthesized, using known methods. The 
sequences falling within the scope of the present invention 
are not limited to the specific sequences described, but 
include human allelic and species variations thereof and 
portions thereof of at least 15-18 bases. (Sequences of at 

20 least 15-18 bases can be used, for example, as PCR primers or 
as DNA probes.) In addition, the invention includes the 
entire coding sequence associated with the specific 
polynucleotide, sequence of bases described in the Sequence • 
Listing, as well as portions of the entire coding sequence of 

25 at least 15-18 bases and allelic and species variations 
thereof. Furthermore, to accommodate codon variability, the 
invention includes sequences coding for the same amino acid 
sequences as do the specific sequences disclosed herein. 
Finally, although the error rate in the automated sequencing 

30 used in the present invention is small, there remains some 

chance of error. Therefore, claims to particular sequences 
should not be so narrowly construed as to require inclusion 
of erroneously identified bases or to exclude corrections. 
Any specific sequence disclosed herein can be readily 

35 screened for errors by resequencing each EST in both 

directions (i.e., sequence both strands of cDNA) . 



WO 93/16178 



PCT/US93/01294 



-19- 

The sequences, constructs, vectors, clones, and other 
materials comprising the present invention can advantageously 
be in enriched or isolated form. As used herein, "enriched" 
means that the concentration of the material is at least 
5 about 2, 5, 10, 100, or 1000 times its natural concentration 
(for example) , advantageously 0.01%, by weight, preferably at 
least about 0.1% by weight. Enriched preparations of about 
0.5%, 1%, 5%, 10%, and 20% by weight are also contemplated. 
Further, removal of clones corresponding to ribosomal RNA and 
10 "housekeeping" genes and clones without human cDNA inserts 

results in a library that is "enriched" in the desired 
clones . 

The term "isolated" requires that the material be 
removed from its original environment (e.g., the natural. 

15 environment if it is naturally occurring) . For example, a 
naturally- occurring polynucleotide present in a living animal 
is not isolated, but the same polynucleotide, separated from 
some or all of the coexisting materials in the natural 
system, is isolated. 

20 It is also advantageous that the sequences be in 

purified form. The term "purified" does .not require absolute 
purity; rather, it is intended as a relative definition. 
Individual EST clones isolated from a cDNA library have been 
conventionally purified to elect rophoretic homogeneity. The 

25 sequences obtained from these clones could not be obtained 
. directly either from the library or from total human DNA. 
The cDNA clones are not naturally occurring as such, but 
rather are obtained via manipulation of a partially purified 
naturally occurring substance (messenger RNA) . The 

30 conversion of mRNA into a cDNA library involves the creation 

of a synthetic substance (cDNA) and pure individual cDNA 
clones can be isolated from the synthetic library by clonal 
selection. Thus, creating a cDNA library from messenger RNA 
and subsequently isolating individual clones from that 

35 library results in an approximately 10 6 -fold purification of 
the native message. Purification of starting material or 



WO 93/16178 



PCT/US93/01294 



-20- 

natural material to at least one order of magnitude, 
preferably two or three orders, and more preferably four or 
five orders of magnitude is expressly contemplated. 

In a cDNA library there are many species of mRNA 
5 represented. Each cDNA clone can be interesting in its own 
right, but must be isolated from the library before further 
experimentation can be completed. In order to sequence any 
specific cDNA, it must be removed and separated (i.e. 
isolated and purified) from all the other sequences. This 

10 can be accomplished by many techniques known to those of 
skill in the art. These procedures normally involve 
identification of a bacterial colony containing the cDNA of 
interest and further amplification of that bacteria. Once a 
cDNA is separated from the mixed clone library, it can be 

15 used as a template for further procedures such as nucleotide 
sequencing. 

Although claims to large numbers of . ESTs and 
corresponding sequences are presented herein, the invention 
is hot limited to these particular groupings of sequences. 

20 Thus, individual sequences are considered as applicants 1 
discoveries or inventions, as are subgroupings of sequences. 
All of the functional subgroupings set forth in the tables 
define groupings for which separate claims are contemplated 
as being within the scope of this invention. Moreover, in 

25 addition to claims to individual clones, it is intended that 
the present disclosure also support claims to numerical 
subgroupings. Thus, subgroupings of 50 ESTs (and 

corresponding sequences) are contemplated (e.g., SEQ ID NOS 
1-50, 51-100, 101-150, etc.) as being within the scope of 

30 this invention, as are subgroupings of 5, 10, 25, 100, 200, 
and 500 ESTs and corresponding sequences. 

III. DNA Constructs 

The present invention also includes recombinant 
35 constructs comprising one or more of the sequences as broadly 

described above. The constructs comprise a vector, such as 
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a plasmid or viral vector, into which a sequence of the 
invention has been inserted, in a sense or antisense 
orientation. In a preferred aspect of this embodiment, the 
construct further comprises regulatory sequences, including 
5 for example, a promoter, operably linked to the sequence. 
Large numbers of suitable vectors and promoters are known to 
those of skill in the art, and are * commercially available. 
The following vectors are provided by way of example. 
Bacterial i pBs, phagescript, 0X174, pBluescript SK, pBs KS, 

10 pNH8a, pNH16a, pNH18a, pNH46a (Stratagene) ; pTrc99A, 
PKK223-3, pKK233-3, pDR540, pRITS (Pharmacia). 
Eukarvotic: pWLneo, pSV2cat, pOG44, pXTl, pSG (Stratagene); 
pSVK3, pBPV, pMSG, pSVL (Pharmacia). 

Promoter regions can be selected from any desired gene 

15 using CAT (chloramphenicol transferase) vectors or other 
vectors with selectable markers. Two appropriate vectors are 
pKK232-8 and pCM7. Particular named bacterial promoters 
include lad, lacZ, T3, T7, gpt, lambda P R , and trc. 
Eukaryotic promoters include CMV immediate early, HSV 

20 thymidine kinase, early and late SV40, LTRs from retrovirus, 
and mouse metallothionein-I . Selection of the appropriate 
vector and promoter is well within the level of ordinary 
skill in the art. 

In a further embodiment, the present invention relates 

25 to host cells containing the above-described construct. The 
host cell can be a higher eukaryotic cell, such as a 
mammalian cell, or a lower eukaryotic cell, such as a yeast 
cell, or the host cell can be a procaryotic cell, such as a 
bacterial cell. Introduction of the construct into the host 

,30 cell can be effected by calcium phosphate transf ection, DEAE 

dextran mediated transf ection, or electroporation (Davis, L. , 
Dibner, M. , Battey, I., Basic Methods in Molecular Biology, 
(1986)). 

The constructs in host cells can be used in a 
35 conventional manner to produce the gene product coded by the 

recombinant sequence. Alternatively, the encoded polypeptide 



WO 93/16178 



PCT/US93/01294 



-22- 

can be synthetically produced by conventional peptide 
synthesizers. 

Certain ESTs have already been preliminarily categorized 
by analogy to related sequences in other organisms (see Table 
5 2) . Table 10 of Example 10 categorizes particular ESTs 

broadly as metabolic, regulatory, and structural sequences 
where known. Constructs comprising genes or coding -sequences 
corresponding to each of these categories are, therefore, 
specifically and individually contemplated. 

10 Table 11 more particularly separates . 127 new ESTs into 

13 categories using a different criteria. These are genes 
related to cell surface; developmental control; energy 
metabolism; kinase and phosphatase; oncogenes; other 
metabolism-related polypeptides; peptidases and peptidase 

15 inhibitors!; receptors; structural and cytoskeletal; signal 

transduction; transporters; transcription, translation, and 
subcellular localization; and transcription factors; Table 
11 further identifies the EST by the particular gene product 
for which it apparently codes . Each of these categories 

20 individually comprises a preferred category of EST, and 
.preferred constructs and resulting polypeptide can be 
prepared from those ESTs or the corresponding complete gene 
sequence . • ' 

25 IV. ESTs and Corresponding Sequences as Reagents 

. Each of the cDNA sequences identified herein (and the 
corresponding complete gene sequences) can be used in 
numerous ways as polynucleotide reagents. The sequences can 
be used as diagnostic probes for the presence of a specific 

30 mRNA in a particular cell type. In addition, these sequences 

can be used as diagnostic probes suitable for use in genetic 
linkage analysis (polymorphisms) . Further, the sequences can 
be used as probes for locating gene regions associated with 
genetic disease, as explained in more detail below. 

35 The EST and complete gene sequences of the present 

invention are also valuable for chromosome identification. 
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Each sequence is specifically targeted to and can hybridize 
with a particular location on an individual human chromosome. 
Moreover, there is a current need for identifying particular 
sites on the chromosome* Few chromosome marking reagents 
5 based on actual sequence data (repeat polymorphisms) are 
presently available for marking chromosomal location. The 
present invention constitutes a major expansion of available 
chromosome markers. One hundred ESTS have already been 
mapped to chromosomes. Using the techniques described in 

10 Example 5 or 6, the remaining ESTs and the corresponding 
complete sequences can similarly be mapped to chromosomes. 
The mapping of ESTs and cDNAs to chromosomes according to the 
present invention is an important first step in correlating 
those sequences with genes associated with disease. 

15 Briefly, sequences can be mapped to chromosomes by 

preparing PCR primers (preferably 15-25 bp) from the ESTs. 
Computer analysis of the ESTs is used to rapidly select 
primers that do not span more than one exon in the genomic 
DNA, thus complicating the amplification process. These 

20 primers are then used for PCR screening of somatic cell 
hybrids containing individual human chromosomes. Only those 
hybrids containing the human gene corresponding to the EST 
will yield an amplified fragment . 

PCR mapping of somatic cell hybrids is a rapid procedure 

25 for assigning a particular EST to a particular chromosome. 

Three or more clones can be assigned per day using a single 
thermal cycler. Using the present invention with the same 
oligonucleotide primers, sublocalization can be achieved with 
panels of fragments from specific chromosomes or pools of 

30 large genomic clones in an analogous manner. Other mapping 
strategies that can similarly be used to map an EST to its 
chromosome include in situ hybridization, prescreening with 
labeled flow- sorted chromosomes and preselection by 
hybridization to construct chromosome specific cDNA 

35 libraries. Results of mapping ESTs to chromosomal segments 
are listed in Tables 3 and 4 * 
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Fluorescence in situ hybridization (FISH) of a cDNA 
clone to a metaphase chromosomal spread can be used to 
provide a precise chromosomal location in one step. This 
technique can be used with cDNA as short as 500 or 600 bases; 
5 however, clones larger than 2,000 bp have a higher likelihood 
of binding to a unique chromosomal location with sufficient 
signal intensity for simple detection. FISH requires use of 
the clone from which the EST was derived, and the longer the 
better, 2,000 bp is good, 4,000 is better, and more than 

10 4,000 is probably not necessary to get good results a 
reasonable percentage of the time. For a review of this 
technique, see Verma et al., Human Chromosomes: a Manual of 
Basic Techniques. Pergamon Press, New York (1988) . 

Reagents for chromosome mapping can be used individually 

15 (to mark a single chromosome or a single site on that 

chromosome) or as panels of reagents (for marking multiple 
sites and/or multiple chromosomes) . Reagents corresponding 
to noncoding regions of the genes actually are preferred, for 
mapping purposes. Coding sequences are more likely to be 

20 conserved within gene families, thus increasing the chance of 
cross hybridizations during chromosomal mapping (see Tables 
8 and 9) ♦ 

Once a sequence has been mapped to a precise chromosomal 
location, the physical position of the sequence on the 

25 chromosome can be correlated with genetic map data. (Such 
data are found, for example, in V. McKusick, Mendelian 
Inheritance in Man (available on line through Johns Hopkins 
University Welch Medical Library) . ) The relationship between 
genes and diseases that have been mapped to the same 

30 chromosomal region are then identified through linkage 
analysis (coinheritance of physically adjacent genes) . 

Next, it is necessary to determine the differences in 
the cDNA or genomic sequence between affected and unaffected 
individuals. If a mutation is observed in some or all of the 

35 affected individuals but not in any normal individuals, then 
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the mutation is likely to be the causative agent of the 
disease . 

With current resolution of physical mapping and genetic 
mapping techniques, a cDNA precisely localized to a 
5 chromosomal region associated with the disease could be one 
of between 50 and 500 potential causative genes. (This 
assumes 1 megabase mapping resolution and one gene per 20 
kb.) 

Comparison of affected and unaffected individuals 

10 generally involves first looking for structural alterations 
in the chromosomes, such as deletions or translocations that 
are visible from chromosome spreads or detectable using PCR 
based on that cDNA sequence. Ultimately, complete sequencing 
of genes from several individuals is required to conf irm the 

15 presence of a mutation and to distinguish mutations from 
polymorphi sms . 

In addition to the foregoing, the sequences of the 
invention, as broadly described, can be used to control gene 
expression through triple helix formation or antisense DNA or 

20 RNA, both of which methods are based on binding of a 
. polynucleotide sequence to DNA or RNA. Polynucleotides 
suitable for use in these methods are usually 20 to 40 bases 
in length and are designed to be complementary to a region of 
the gene involved in transcription (triple helix - see Lee et 

25 al, Nucl. Acids Res. 6: 3073 (1979); Cooney et al, Science 

241: 456 (1988); and Dervan et al, Science 251: 1360 (1991)) 
or to the mRNA itself (antisense - Okano, J. Neurochem. 56: 
560 (1991) ; Oligodeoxynucleotides as Antisense Inhibitors of 
Gene Expression, CRC Press, Boca Raton, PL (1988)). Triple 

30 helix formation optimally results in a shut-off of RNA 

transcription from DNA, while antisense RNA hybridization 
blocks translation of an mRNA molecule into polypeptide. 
Both techniques have been demonstrated to be efficient in 
model systems. Information contained in the sequences of the 

35 present invention is necessary for the design of an antisense 
or triple helix oligonucleotide. 
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The present invention is also useful tool in gene 
therapy, which requires isolation of the disease-associated 
gene in question as a prerequisite to the insertion of a 
normal gene into an organism to correct a genetic defect. B 
5 high specificity of the cDNA probes according to this 
invention have promise of targeting such gene locations in a 
highly accurate manner. 

The sequences of the present invention, as broadly 
defined, are also useful for identification of individuals 

10 from minute biological samples. The United States military, 
for example, is considering the use of restriction fragment 
length polymorphism (RFLP) for identification of its 
personnel. In this technique, an individual's genomic DNA is 
digested with one or more restriction enzymes, and probed on 

15 a Southern blot to yield unique bands for identifying 
personnel. This method does not suffer from the current 
limitations of "Dog Tags" which can be lost, switched, or 
stolen, making positive identification difficult. The 
sequences of the present invention are useful as additional 

20 DNA markers for RFLP. 

However, RFLP is a pattern based technique, which does 
- not directly focus on the actual DNA sequence of the. 
individual. The sequences of the present invention can be 
used to provide an alternative technique that determines the 

25 actual base-by-base DNA sequence of selected portions of an 
individual's genome. These sequences can be used to prepare 
PCR primers for amplifying and isolating such selected DNA. 
One can, for example, take an EST of the invention and 
prepare two PCR primers from the 5 1 and 3 1 ends of the EST . 

30 These are used to amplify an individual's DNA, corresponding 

to the EST. The amplified DNA is sequenced. 

Panels of corresponding DNA sequences from individuals, 
made this way, can provide unique individual identifications, 
as each individual will have a unique set of such DNA 

35 sequences, due to allelic differences. The sequences of the 

present invention can be used to particular advantage , to 
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obtain such identification sequences from individuals and 
from tissue, as explained in Examples 12-14. 

The EST sequences from Examples 1 and 2 and the complete 
sequences from Example 13 uniquely represent portions of the 
5 human genome. Allelic variation occurs to some degree in the 
coding regions of these sequences, and to a greater degree in 
the noncoding regions. It is estimated that allelic 
variation between individual humans occurs with a frequency 
of about once per each 500 bases. Each of the ESTs or 

10 complete coding sequences comprising a part of the present 
invention can, to some degree, be used as a standard against 
which DNA from an individual can be compared for 
identification purposes. Because greater numbers of 
polymorphisms occur in the noncoding regions, fewer sequences 

15 are necessary to differentiate individuals. The noncoding 
sequences of Table 9 for example, could comfortably provide 
positive individual identification with a panel of perhaps 
100 to 1,000 primers which each yield a noncoding amplified 
sequence of 100 bp. If predicted coding sequences, such as 

20 those from Table 6, are used, a more appropriate number of 
primers for positive individual identification would be 500- 
2,000. 

If a panel of reagents from ESTs complete sequences 
of this invention is used to generate a unique ID database 

25 for an individual, those same reagents cam later be used to 
identify tissue from that individual. Positive 
identification of that individual, living or dead can be made 
from extremely small tissue samples. 

Another use for DNA-based identification techniques is 

30 in forensic biology. PCR technology can be used to amplify 
DNA sequences taken from very small biological samples such 
as tissues, e.g., hair or skin, or body fluids, e.g., blood, 
saliva, semen, etc. In one prior art technique, gene 
sequences are amplified at specific loci known to contain a 

35 large number of allelic variations, for example the DQa class 

II HLA gene (Erlich, H. , PCR Technology, Freeman and Co. 
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(1992) ) . Once this specific area of the genome is amplified, 
it is digested with one or more restriction enzymes to yield 
an identifying set of bands on a Southern blot probed with 
DNA corresponding to the DQa class II HLA gene, 
5 The sequences of the present invention can be used to 

provide polynucleotide reagents specifically targeted to 
additional loci in the human genome, and can enhance the 
reliability of DNA-based forensic identifications. Those 
sequences targeted to noncoding regions (see, e.g./ Tables 8 

10 and 9) are particularly appropriate. As mentioned above, 
actual base sequence information can be used for 
identification as an accurate alternative to patterns formed 
by restriction enzyme generated fragments. Reagents for 
obtaining such sequence information are within the scope of 

15 the present invention.. Such reagents can comprise complete 
ESTs or corresponding coding regions, or fragments of either 
of at least 15 bp, preferably at least 18 bp. 

There is also a need for reagents capable of identifying 
the source of a particular tissue. Such need arises, for 

20 example, in forensics when presented with tissue of unknown 
origin. Appropriate reagents can comprise, for example, DNA 
probes or primers specific to particular tissue prepared from 
the ESTs or complete sequences of the present invention. 
Panels of such reagents can identify tissue by species and/or 

25 by organ type. In a similar fashion, these reagents can be 
• used to screen tissue culture for contamination. 

V. Production of Polypeptide Corresponding to ESTs 

As previously explained, each EST corresponds not only 
30 to a coding region, but also to a polypeptide. Once the 

coding sequence is known, or the gene is cloned which encodes 
the polypeptide, conventional techniques in molecular biology 
can be used to obtain the polypeptide. 

At the simplest level, the amino acid sequence encoded 
35 by the polynucleotide sequence can be synthesized using 

commercially available peptide synthesizers. This is 
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particularly useful in producing small peptides and fragments 
of larger polypeptides. (Fragments are useful, for example, 
in generating antibodies against the native polypeptide,) 

Alternatively, the DNA encoding the desired polypeptide 
5 can be inserted into a host organism and expressed. The 
organism can be a bacterium, yeast, cell line, or 
multicellular plant or animal: The literature is replete 
with examples of suitable host organisms and expression 
techniques. For example, naked polynucleotide (DNA or mRNA) 

10 can be injected directly into muscle tissue of mammals, where 
it is expressed. This methodology can be used to deliver the 
polypeptide to the animal, or to generate an immune response 
against a foreign polypeptide, Wolff, et al., Science 
247:1465 (1990); Feigner, et al., Nature 349:351 (1991); 

15 Alternatively, the coding sequence, together with appropriate 
regulatory regions (i.e., a construct), can be inserted into 
a vector, which is then used to transfect a cell. The cell 
(which may or may not be part of a larger organism) then 
expresses the polypeptide. (See Example 25.) 

20 Antibodies generated against the polypeptide 

corresponding to a sequence of the present invention can be 
obtained by direct injection of the naked polypeptide into an 
animal (as above) or by administering the polypeptide to an 
animal, preferably a nonhuman. The antibody so obtained will 

25 then bind the polypeptide itself. In this manner, even- a 
sequence encoding only a fragment of the polypeptide can be 
used to generate antibodies binding the whole native 
polypeptide. Such antibodies can then be used to isolate the 
polypeptide from tissue expressing that polypeptide. 

30 Moreover, a panel of such antibodies, specific to a large 

number of polypeptides, can be used to identify and 
differentiate such tissue. 

VI. Examples 

35 Certain aspects of the present invention are described 

in greater detail in the non- limiting- Examples that follow. 



WO 93/16178 



PCT/US93/01294 



-30- 

EXAMPLE 1 

cDNA Sequences Determined by y »mfom 
Clone Selection: First set 

5 

METHODOLOGY: 

With reference to the data presented in Table 1, lambda 
ZAP libraries were converted en masse to pBluescript « 
plasmids, trahsfected into E. coli XLl-Blue cells, and plated 

10 on X-gal/IPTG/ampicillin plates . A total of 1058 clones were 
picked at random from three human brain cDNA libraries : 
fetal brain, two-year-old hippocampus, and two-year-old 
temporal cortex (Stratagene catalog #936206, 936205, 935, 
respectively. Stratagene, 11099 N. Torrey Pines Rd., La 

15 Jolla, CA 92037) , An analysis of these clones is summarized 
in Table I (see below) In addition, clones selected from the 
hippocampus library were also analyzed after subtract ive 
hybridization with the fibroblast library • These results are 
listed in the "Hippocampus Subtracted" column of Table 1. 

20 Templates for DNA sequencing were PCR products or plasmids 
prepared by the alkaline lysis method. About half of the 
templates prepared by PCR failed to yield an amplified 
fragment suitable for sequencing. This was primarily due to 
use of PCR conditions that minimized the need for further 

25 .purification of the product but also selected against 
amplification of long inserts (5 fil fresh or frozen overnight 
culture of E. coli carrying the pBluescript plasmid, 7.5 /xM 
each dNTP, and 0.1 fiM each primer for 35 cycles: 94°C, 
40 sec; 55°C, 40 sec; 72°C, 90 sec). A further percentage of 

30 the PCR-generated templates failed to sequence, largely due 

to primer-dimer or other amplification artifacts. Qiagen™ 
columns improved the percentage of plasmid templates, 
increasing the yields of usable sequence from about 60% with 
a standard alkaline lysis protocol to over 90%. Overall, 117 

35 PCR-generated templates and 497 plasmid templates resulted in 

usable sequence. Dideoxy chain termination sequencing 
reactions were performed with fluorescent dye-labeled M13 
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universal or reverse primers . After a cycle sequencing 
protocol, carried out in a Perkin- Elmer thermal cycler, 
sequencing reactions were run on an Applied Biosystems, Inc. 
(Foster City, CA) 373A automated DNA sequencer, (Cycle 
5 sequencing was performed in a Perkin Elmer Thermal Cycler for 
15 cycles of 95°C, 30 sec; 60°C, 1 sec; 70°C, 60 sec and 
15 cycles of 95°C, 30 sec; 70°C, 60 sec with the Applied 
Biosystems, Inc. Taq Dye Primer Cycle Sequencing Core Kit 
protocol) . Some sequencing reactions were performed on ah 
10 ABI robotic workstation (Cathcart, Nature 347: 310 (1990) 
hereby incorporated by reference) . 

RESULTS : 

Singe -run DNA sequence data were obtained from 609 

15 randomly chosen cDNA clones. The number of clones sequenced 
from each library is summarized in Table 1. Double -stranded 
cDNA clones in the pBluescript vector were sequenced by a 
cycle sequencing protocol with dye- labeled primers and 
Applied Biosystems, Inc. 373A DNA Sequences. The average 

20 length of usable sequence was 397 bases with a standard 
deviation of 99 bases. 

Subtractive hybridization has been used successfully to 
reduce the population of highly represented sequences in a fc 
cDNA library by selectively removing sequences shared by 

25 another library. (Schmid and Girou, Neurochom. 48: 307 
(1987); Fargnoli et al, Anal. Biochem. 187: 364 (1990); 
Duguid and Dinauer, Nucl. Acids. Res. 18: 2789 (1990); 
Schweinfest, et al, Genet. Anal. Techn. Appl. 7: 64 (1990); 
Travis and Sutcliffe, Proc. Natl. Acad. Sci. USA 85: 1696 

30 (1988); Kato, Eur. J. Neurosci. 2: 704 (1990)). Subtractive 

hybridization was therefore tested as a way of enhancing the 
number of brain-specific clones in the hippocampus library by 
hybridizing the hippocampus library with a WI38 human lung 
fibroblast cell line cDNA library and removing the common 

35 sequences (Schweinfest et al, Genet. Anal. Techn. Agpl. 7: 64 

(1990); Sive and St. John, Nucl. Acids Res. 16: 10937 
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(1988) ) . Clones from this subtraction are listed in the 
column "Hippocampus Subtracted" in Table 1. 

The EST sequences from this Example 1 are identified as 
SEQ ID NOs 1-315. 



WO 93/16178 



PCT/US93/01294 



-33- 



m eo oinsin 

NOKtOlANMN 



fM CM 



1 



§ 



Ul 

3 



(0 
-J 

1 



O eo m cm <o 

MAOOMNOO 



«| CM O I 



c • 
« a o 

» a 



U O — ' 3C 0>£ CD 



— < a a 

L. — » +J 
*D — # U CO O 

c g 3 x « 

2 o J. 

i- to a 

j5 ^ &.o Jr .8 a ' 
o — fli — «-» a o 

a o o 

o ox 



co o 

Of .C . _ 
to O CO to 



SUBSTITUTE SHEET 



WO 93/16178 PCT/US93/01294 

-34- 
EXAMPLE 2 

Sequencing of Additional ESTs; Second set 

Over 2600 additional cDNA clones have been isolated, 
partially sequenced and screened. The clones were isolated 
from four human brain cDNA libraries. The new sequences thus 
discovered, together with the 315 brain ESTs from Example 1, 
correspond to over 2400 new human genes. These data 
represent an approximate d9ubling of the number of human 
genes identified by DNA sequencing. 

Specifically, four cDNA libraries were used as sources 
of clones for sequencing. Human hippocampus and fetal brain 
libraries, plasmid template preparation, sequencing 
reactions, and automated sequencing were performed as 
described (Adams, M.D., Kelley, J.M., Gocayne, J.D., Dubnick, 
M. , Polymeropoulos, M.H., Xiao, H., Merril, C.R., Wu, A., 
Olde, B., Moreno, R.F., Kerlavage, A.R., MCCombie, W.R., & 
Venter, J. C. Science, 252: 1651-56 (1991)). A pooled probe 
consisting of inserts from 10 different EST clones with 
sequences that matched either mitochondrial genes or the 18S 
or 28S ribosomal RNAs was used to prescreen a gridded filter 
array of the hippocampus library; nonhybridizing clones are 
referred to as the M prescreened library". Another fetal 
brain library was constructed by and was a gift from Bento 
Soares (Columbia University) . A directionally-cloned library 
was prepared using the method of Rubenstein, et al. 
(Rubenstein, J., Elizabeth, A., Brice, A., Ciaranello, R., 
Denney, D., Porteus, M. & Usdin, T. Nucl. Acids Res. 18: 
4833-4842) using human adult brain mRNA purchased from 
Clontech (Palo Alto, CA; Catalogue # 6516-i) . Of 482 clones 
analyzed by restriction enzyme digestion, 33% contained 
inserts at least 1500 base pairs in length. Stratagene 
hippocampus and fetal brain library totals include data from 
Adams et al Science 252: 1651. 

Sequences of nuclear-encoded cDNAs that did not include 
interspersed repeats (Schmid, C. W. & Jelinek, W. R. Science 
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216: 1065-1070 (1982); Paulson, K. E. # Deka, N., Schmid, C. 
W., Misra, R., Schlinder, C. W., Rush, M. G., Kadyk, L., & 
Leinwand, L. Nature 316: 359-361 (1985); Fanning, T. G. & 
Singer, M. F. Biochem. Biophys. Acta 910: 203-212 (1987)) 
were searched against all of GenBank and, in 6- frame 
translation, against a compr ehensive, non-redundant peptide 
database usihg the network BLAST (Altschul, S. F. , Gish, W., 
Miller, W. , Myers, E.W., & Lipman, D. J. Mol. Biol. 215: 
403-410 (1990)) server at the National Center for 
Biotechnology Information. BLAST output was parsed, and an 
interactive alignment editor was used to select which 
matches, if any, from each search to record in a relational 
EST database, which was developed to track sequencing, 
identification, tissue localization, physical mapping, and 
the public distribution of the clones, mapping and sequence 
data. For significant similarities, a putative gene name and 
Protein Identification Resource (PIR) gene family 
identification (Barker, W., George, D., Hunt, L, , & 
Garavelli, J. NUcl. Acids Res. 19 (Suppl) : 2231-2236 (1991)) 
for the EST were assigned. ESTs without significant matches 
using BLAST were searched in translation against PIR using 
FASTA* Ten additional marginal matches were found. A total 
of 2300 new EST sequences comprising 765,505 nucleotides from 
the current data set have been submitted to GenBank and 
assigned accession numbers M77851-M79278 and M85308-M86179 . 
All ESTs except those multiply representing actin, tubulin, 
and myelin basic protein clones were submitted. ATCC 
accession numbers of cDNA clones from which ESTs were derived 
are 77501-78999 and 81000-81756. The Genome Data Base 
expressed D-segment numbers for these clones are D0S1E - 
D0S2422E. The ESTs from this Example are identified herein 
as SEQ ID NOs 316-2407. 
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EXAMPLE 3 
EST Characterization: First Set 

ESTs including SEQ ID NOs 1-315 were analyzed as 
follows. Initially , the EST sequences were examined for 
similarities in the GenBank nucleic acid database (GenBank 
Release 65.0), Protein Information Resource Release 26*0 
(PIR) , and ProSite (MacPattern from the EMBL data library, 
Fuchs R. Cosrput. Appl. Biosci. 7: 105 (1990) Release 5.0 were 
used) . BLAST was used to search Genbank and the PIR (both 
maintained by the National Center for Biotechnology 
Information) ESTs without exact GenBank matches were 
translated in all six reading frames and each translation was 
compared with the protein sequence database PIR and the 
ProSite protein motif database. Comparisons with the ProSite 
motif database were done by means of the program MacPattern 
from the EMBL Data Library. . GenBank and PIR searches were 
conducted with the "basic local alignment search tool" 
programs for nucleotide (BLASTN) and . peptide (BLASTX) 
comparisons (Altschul et al, J. Mol. Biol, 215: 403 (1990)). 
PIR searches were run on the National Center for 
Biotechnology Information BLAST network service. The BLAST 
programs contain a very rapid database -searching algorithm 
that searches, for local areas of similarity between two 
sequences and then eixtends the alignments on the basis of 
defined match and mismatch criteria. The algorithm does not 
consider the potential gaps to improve the alignment, thus 
sacrificing some sensitivity for a 6-80 fold increase in 
speed over other database- searching programs such as FASTA 

(Peqarson and Lipman, Proc. Natl. Acad. Sci. USA, 85: 2444 

(1988)). 

Sequence similarities identified by the BLAST programs 
were considered statistically significant with a Poisson P- 
value than 0.01. The Poisson P-value less than the 
probability of as high a score occurring by chance given the 
number of residues in the query sequence and the database. 
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After the BLASTN search, 30 unmatched ESTs were compared 
against GenBank by FASTA to determine if signif icant matches 
were missed due to the use of BLASTN for the database search. 
No additional statistically significant matches were found, 
v 5 Statistical significance does not necessarily mean functional 

similarity; some of the reported matches may indicate the 
presence of a conserved domain or motif or simply a common 
protein structure pattern. Those ESTs identified as fully 
corresponding to known human genes or proteins are not 

10 included in this disclosure. Statistically significant 

matches are reported in Table 2, together with the length and 
percent identity or similarity of each alignment . 

On the basis of database searches, 609 EST sequences 
were classified into eight groups as shown in Table 1 (see 

15 Example 1 above) . Four groups, with 197 or 32% of the 

sequences, consist of matches to human sequences: repetitive 
elements, mitochondrial genes, ribosomal RNA genes, and other 
nuclear genes. Forty-eight (8%) of the sequences matched 
non-human entries in GenBank or PIR while 230 (38%) had no 

20 significant matches. The remaining 134 (22%) sequences 

contained no insert or consisted entirely of polyA between 
the EcoRI cloning sites. 

Thirty- six ESTs matched previously sequenced human 
nuclear genes with more than 97% identity. Four of these 

25 ESTs are from genes encoding enzymes involved in maintaining 

metabolic energy, including ADP/ATP translocase, aldolase C, 
hexokinase, and phosphoglycerate kinase. Human homologs of 
genes for the bovine mitochondrial ATP synthase Fofi-subunit 
and porcine aconitase were also found (Table 2) . Brain- 

30 specific cDNAs included synaptophysin, glial • fibrillary 

acidic protein (GFAP) , and neurofilament light chain. At 

* least six ESTs are from genes encoding proteins involved in 
signal transduction : 2 1 , 3 1 -cyclic nucleotide 3 1 - 

* phosphodiesterase (2 ESTs) , calmodulin, c-erbA-a-2, G s a, and 
35' Na + /K + ATPase a-subunit. Other ESTs were matches to genes 

for ubiquitous structural proteins act ins, tubulins, and 
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fodrin (non-erythroid spectrin). ESTs also document the 
presence in the hippocampus cDNA library of the ret proto- 
oncogene, the ras- related gene rhoB, and one of the 
chromosome 22 breakpoint cluster region transcripts. Eight 
ESTs are from genes known to be associated with genetic 
disorders (Online Mendelian Inheritance in Man ) . More than 
half of the human-matched ESTs from Example 1 have been 
mapped to chromosomes, indicating the bias of GenBank entries 
toward well -studied genes and proteins. 

ESTs without significant GenBank matches were also 
compared to the ProSite database of recognized protein 
motifs . Not . counting post-translational-modif ication 

signatures, fifty- four sequences contained motifs from the 
database. Some patterns, particularly the "leucine zipper", 
are found in scores or hundreds of proteins that do not share 
the functional property implied by the presence of the motif. 

Similarities to sequences from other organisms were also 
detected in the BLAST searches of GenBank and PIR (Table 2) . 
Several ESTs displayed similarity to "housekeeping" genes, 
including, the ribosomal proteins S10 and L30 (rat) and the 
above glycolytic enzymes . EST00257 (SEQ ID NO: 77) shows 
strong nucleotide sequence similarity to the squid (67%) and 
Drosophila (70.4%) kinesin heavy chain.. Kinesin was first 
described as a microtubule -associated motor protein involved 
in organelle transport in the squid giant axon (Vale et al, 
Cell 42: 39 (1985)). Six oncogene -related sequences were 
also among the cDNA clones sequenced. EST00299 (SEQ ID 
NO:180) and EST00283 (SEQ ID NO:271) show similarity to 
several ras-related genes and EST00248 (SEQ ID NO: 102) 
matched the 3 1 untranslated region of the bovine substrate of 
botulinum toxin ADP-ribosyltransf erase . Similarities with an 
S . cerevisiae RNA polymerase subunit and Torpedo electromotor 
neuron-associated protein were also observed. Two ESTs may 
represent new members of known human gene families: EST00270 
matched the three S- tubulin genes with 88-91% identity and 
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EST00271 (SEQ ID NO: 248) matched a-actinin with 85% identity 
at the nucleotide level. 

Among the most interesting of the primary sequence 
relationships was the similarity of ESTs to the Drosophila 
5 genes Notch and Enhancer of split. Nucleotide and peptide 

alignments of EST00256 (SEQ ID NO: 188) and EST00259 (SEQ ID 
NO: 227) with the Drosophila genes have been demonstrated. ; 
Both genes are part of a signal cascade encoded by the 
"neurogenic" genes that are involved in the differentiation 

4 

10 of neuronal and epidermal cell lineages in the neuroectoderm 

of the developing Drosophila embryo (Campos -Ortega, Trends in 
Neuro. Sci. 11: 400 (1988)). It has been proposed that the 
Enhancer of split protein interacts with a membrane protein 
that is the product of the Notch gene to convert a 

15 developmental signal into an altered pattern of gene 

expression (id. J. Mol. Biol. 215: 403 (1990)). EST00256 
(SEQ ID NO: 188) matches near the 5' end of the Enhancer of 
split coding sequence, away from the mammalian G protein £ 
subunit- and yeast cdc4-like elements (Hartley et al, Cell 

20 55: 785 (1988); Klambt et al. EMBO J. 8: 203 (1989)). Part 

of the EST00259 (SEQ ID NO: 227) match to Notch in the 
. cdclO/SW16 region that is similar to three cell-cycle control 
genes in yeast and is tightly conserved in the Xenopus Notch 
homolog, Xotch. In Drosophila, Enhancer of split is 

25 absolutely required for formation of epidermal tissue. Notch 

contains several epidermal growth factor- like repeats and 
appears to play a general role in cell -cell communication 
during development (Banerjee and Zipursky, Neuron 4:177 
(1990)). 

30 Seven genes were represented by more than one EST. 

Comparisons of all the ESTs against one another revealed two 
overlaps of unknown ESTs: EST00233 (SEQ ID NO: 32) and 
EST00234 (SEQ ID NO: 8) match in opposite orientations and 
EST00235 (SEQ ID NO:204) and EST00236 (SEQ ID NO: 148) match 

35 in the same orientation beginning at the same nucleotide. 

Five human genes were represented by more than one EST: S- 
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actin (3), X-actin (2), a-tubulin (2), or-2-macroglobulin (2), 
and 2 ' 3 1 -cyclic-nucleotide-3 1 -phosphodiesterase (2) . Those 
few instances where two or more ESTs represent different 
portions of a single cDNA can be readily ascertained when the 
sequence of the full cDNA insert is determined in accordance 
with Example 13. 

Example 4 

EST Sequences Characterization: Second Set 

The ESTs of Example 2, including SEQ ID NOs 316-2407, 
were screened against known sequences listed in GenBank and 
other databases, as in Example 3. The results are reported 
in Table 2. The quality of the match is given as percent 
identity and length in base pairs for nucleotide matches and 
amino acid residues for peptide matches. In many cases ESTs 
match multiple domains on several related proteins; for 
example, EST00825 matches two transmembrane domains on both 
GABA and Norepinephrine transporters- Nucleotide databases 
are: GenBank (GB) , and EMBL (E) ; peptide databases are: 
GenPept (GPU) , Swiss-Prot (SP) , and PIR. 

The, great majority (83%) of the partial cDNA sequences 
reported in Example 2 are unrelated to any sequences 
previously described in the literature. Based on database 
matches to known genes from humans as well as from such 
evolutionarily distant organisms as E. coli, yeast, C. 
elegans, Drosophila, barley, AraJbitfopsis, rice, and green 
algae, we have preliminarily identified the functional type 
of a number of the ESTs (Table 2). These include a novel 
gene similar to Notch/Tan-1 (Adams et al., supra ) , a new 
neurotransmitter transporter gene, and a new member of the 
multi-drug resistance gene family. Several genes involved in 
development or cell differentiation in Drosophila. are 
represented by similar human ESTs, including seven in 
absentia (Carthew, R. & Rubin, G. Cell 63: 561-577 (1990)), 
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big-brain (bib) (Rao, Y., Jan, L., & Jan, Y. Nature 345: 
163-167 (1990)), the discs tumor suppressor (Woods, D. & 
Bryant, P. Cell 66: 1-20 (1991)), and the homeotic gene 
orthodenticle (Finkelstein, R., Smouse, D. Capaci, T. , 
Spradling, A. & Perrimon, N. Genes. Dev. 4: 1516-1527 
(1990)). New members of gene families previously known in 
humans include a Ca +a - transporting ATPase, an ADP 
ribosylation factor, and a new neural-cell adhesion molecule 
gene. 

The 1971 ESTs without a putative identification were 
analyzed using the coding-region prediction program CRM via 
the GRAIL server (Uberbacher, E. & Mural, R. Proc. Natl. 
Acad. Sci. USA 88: 11261-5 (1991)). Fifteen percent of the 
unknown ESTs scored an excellent probability of containing 
protein-coding sequence. Fifty percent of the ESTs to known 
human genes contain protein-coding sequences, therefore, at 
most half of the unknown ESTs are likely to contain coding 
sequences. We have found no evidence that genomic DNA or 
cDNA to unspliced precursor RNA is a major contaminant of 
either the hippocampus or fetal brain library. 
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SEG ID EST* Putative Identification 



Table 2: ESTs Identified by Database Matches 
Accession OB Len %ID 



20B EST00250 60K filarial antigen 
2320 EST01784 60K filarial antigen 
969 EST01982 AOP-fibosytation factor 1 
1834 EST01620 AMP deaminase, brain 
97 EST00289 Aconitase 
251 EST00370 Actin, other 
248 EST00271 Actinin. alpha 
891 EST01891 Actinin. alpha 
1500 EST02S38 Actinin. alpha 
132 EST00110 Agrin 
1852 EST01625 Agrin 
1094 EST02113 Ala 
691 EST00675 Alcohol dehydrogenase 
2408 EST00244 Amyloid A4 
1965 EST01 664 Amyloid A4 
2068 EST01694 Amyloid A4 
2092 EST01 700 Anton exchanger homolog AE3 
1880 EST01634 Axonal glycoprotein TAC-1 



PIR 108 56.9 
PIR 88 50.6 
PIR 84 41.2 
PIR 57 100.0 
PIR 105 90.6 
PIR 44 51.1 
GB 271 85.3 
GB 315 81.6 



A28209 
A26209 
833283 
A37056 
A35544 
SI 0021 
HUM ACT A R 
HU MAC TAR 
HUMACTAR GB 271 75.0 
RATAGR GB 269 82.2 
RATAGR GB 103 84.6 
HUMALA GB 92 82.8 

RICG0S2GJ GPU 38 59.0 
HUMAFPA4 ~ - GB 135 91.9 
A29030 PIR 52 54.7 
QRHUA4 PIR 63 69.0 

A33638 PIR 95 97.9 
A34695 PIR 69 87.1 



1492 EST02530 B cell-specific Mo-MLV integration site 1 (bmi-1) MUSBM11A GB 111 87.5 
1277 EST02306 Bib protein S09699 PIR 57 53.4 

13 EST00255 Cadherins CADN9HUMAN SP 41 45.2 

1348 EST02378 cAMP-dependem pfcitem Wnase inhibitor MUSPKI GB 234 91.5 

1931 EST01O41 cAMP^egulredphospnoprotein B35308 PIR 21 86.4 . 

1413 EST02447 cAMP-specWc prwwphodiesterase HUMPOEAA GB 363 69.0 

396 EST01443 COPtfacylgrycefol -serine O-phosphatidy (transferase JH0368 PIR 33 41.2 
1956 EST01663 Ca2+ -transporting ATPase 2 B28065 PIR 125 88.9 

1126 EST02146 CaTWndin D28 RATCALBD28 GB 81 87.8 

1039 EST02O55 Calcium channel S050E4 . PIR 33 67.6 

1910 EST01645 Calmodulin RATRCM1 GB 120 S0.1 

485 EST01466 Calmoduttn-dependent protein kinase, type II. beta A26464 PIR 93 98.9 
913 EST01913 Clathrin coat assembly protein AP50 homolog YSCYAP54J GPU 62 63.5 



2004 EST01676 CofiOn 
2400 EST01824 Cysteine-rich intestinal protein 
1588 EST02633 02223 repetitive ON A 
2192 EST01257 Diacyfgtycerol kinase, lymphyocyte 
1441 EST02477 Diamine acetyltramferese 

650 EST00642 Dilute (myosin heavy chain) 
2302 EST01 779 Discs-targe tumor suppressor 

188 ESTO0256 Enhancer of split 
2289 EST01325 Fatty acid synthase 

310 EST00377 Fo ATPase beta subunit. mitochondrial 
1332 EST02362 GA binding protein, beta subunit 
1667 EST00825 Gamma-aminobutyric acid transporter 
2217 EST01738 Gelation factor ABP-280 
T412 EST02446 Glutamate -aspartate carrier protein 
1020 EST02034 Glutamtnase 

1885 EST01639 Histocompatibility antigen modifier 1 
1495 EST02533 Hypothetical 43.5K protein 
2326 EST01791 Inositol- 1 ,4.5-trisphosphate 3-kinase 
SEQ ID EST* Putative Identification 



PIGCOFIL GB 132 89.5 
GYRTI PIR 56 66.7 
HUMREP GB 160 76.4 
S09156 PIR 44 42.2 
ATDA$ HUMAN SP 74 45.3 
MUSDILUTE_1 GPU 27 100.0 
DRODLGAJ GPU 53 63.0 
A30047 PIR 86 58.6 
RATFAS GB 98 79.8 

B0VMTAS8 GB 293 85.4 
MUSGACJ GPU. 86 90.8 
A3S918 PIR 26 59.3 
A37098 PIR 74 80.0 
JV0092 PIR 57 37.9 
GLS$RAT SP 34 74.3 
A37779 PIR 63 75.0 
JU0319 PIR 43 52.3 
JN0129 PIR 65 68.2 
Accession DB Len %!D 
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724 EST01 529 Interf eton-induced 54 K protein !Nt4d HUMAN SP 76 70.1 

1035 EST02051 J 1 protein MUSJ1PRO GB 362 85.7 

1229 EST02268 KUP protein HUMKUPMRJ GPU 54 36.4 

993 EST020O7 Kinase 5 protein CHKCEK5_1~ GPU 68 94.2 

77 EST00257 Kinesin A35075 ~>tR 57 86.2 

78 EST00258 Kinesin A35076 PIR 62 47.6 
2246 EST01748 Kinesin A3 507 5 PIR 98 52.5 
2282 EST01764 Lamin 8 receptor A36427 PIR 76 71.4 
2173 EST01724 Lon protease JQ0901 PIR 103 41.3 
1427 EST02463 Long^hain-fattv-acid-CoA ligase A36275 PIR 36 62.2 

313 EST00276 Lysosomal membrane glycoprotein 1 (LAMP-1) A31959 PIR 53 48.3 
161 EST0O247 MARCKS (myristoylated alanine-rich protein kinase BOVMARCKS GB 139 83.6 
1386 EST02418 MARCKS homolog MMF52 EU 237 92.4 

769 EST00734 MARCKS homolog S08341 PIR 61 40.3 

43 EST00371 Maternal G1 0 protein S05955 PIR 38 92.3 

1468 EST02505 Matrin 3 RATMATRIN3 GB 137 93.5 

639 EST00832 Membrane transport superf amity (GTP-dependent) A24400 PIR 63 39.1 
1894 EST01643 Membrane transport superf amity (GTP-dependent) A24400 PIR 71 50.0 
824 EST01865 Mtcrotubule-assoaated protetn 1 B RATNEU G8 293 86.4 

223 EST00368 Microtubule-associated protein IB A33645 PIR 30 54.8 

2032 EST01683 Microtubule-essociated protein IB A33845 PIR 49 62.0 

201 7 EST01 678 M3k f et globufe membrane protein A36479 PIR 48 61.2 

1704 EST01580 Myeloid differentiation primary response gene My01 MUSMY0118 1 GPU 76 88.3 
2226 EST01 744 NAO(P) + v anshydrogenase (B-spectflc) OEBOXM PIR 86 93.1 

1667 EST02610 Neural cell adhesion motecule LI S05479 PIR 82 43.4 

606 EST01471 Neuraxin S06017 PIR 120 84.3 

1566 EST02609 Neutrophil oxidase factor A34856 PIR 43 47.7 

962 EST01961 Notch/Xotch KUMTAN1 1 GPU 85 57.0 

227 EST00259 Notch/Xotch A35B44 PIR 74 85.3 

1396 EST02429 Nuclear factor 1 -tike protein (NF1) HAMNF1A GB 111 92.0 

1681 EST01 573 Nucleoside diphosphate kinase A33366 PIR 71 52.8 

' 346 EST01828 Otd homeotic protein A38912 PIR 35 52.6 

2254 EST01751 PhosphattdyttnositoM ,5 •btsphosphate phosphodtest A28807 PIR 40 90.2 
1869 EST00992 Polymyxin B resistance A32714 PIR 20 76.2 

93 EST00287 Processing enhancing protein S03968 PIR 96 68.8 

2353 EST01806 Protubmn RATPROHtBJ GPU 120 97.5 

2297 EST01775 Prohormone cleavage enxyme MUSMPC1A_1 GPU 91 93.5 

9 EST00376 Prolyl endopeptidase PIG PREP GB ~223 83.9 

1069 EST02087 Protein kinase C.zeta HUMPKCL GB 382 58.7 

1933 EST01650 Protein phosphatase 2A beta subunh HUMPROP2AB GB 288 76.8 

202 EST0029B Protetn-tyrosme phosphatase LRP LRPJMOUSE SP 62 44.4 . 

1654 EST01572 ProtochloraphyUide reductase S04783 PIR 34 57.1 

38 EST00374 RNA polymerase II 6th tubunrt (RP026) A36352 PtR 72 75.3 

1478 EST02515 Rab5 F 3432 3 PIR 91 82.6 

2368 EST01389 Radial spoke protein 3 S05962 PIR 58 52.5 

37 EST00038 ras p2 1 -like small GTP*inding protein (smg GDS) BOVSMGGDS GB 131 89.4 

180 EST00299 ras-f elated proteins S10493 PIR 51 46.1 

1700 EST01 579 Retrovirus -related gag poly protein FOHUE2 PIR 95 77.1 

1511 EST02550 Retroviruvrelated pol poryprotein GNUGL PIR 50 54.9 

102 EST00248 rhoH12/ARH12 BOVBGBRH GB 195 79.6 

1715 EST01583 Ribosomal protein LI 8a R5RT18 PIR 68 95.7 

SEQ ID EST* Putative Identification Accession DB Len %1D 
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1856 EST01627 Ribosomal protein LI a 
1974 EST01667 Ribosamal protein L3 

301 EST00300 Ribosomal protein L30 

22 EST00301 Ribosomal protein SI 0 
2402 EST01826 Ribosomal protein S10 

483 EST01459 Ribosomal protein YL 10 
1408 EST02442 Seven in absentia 

299 EST00249 smg p26A GDP dissociation inhibitor 
951 EST01960 Spectrin, beta 
2089 EST01699 Sperm membrane protein 
2073 EST01697 Succinate dehydrogenase flavoproteln 
2 1 38 EST0 1715 Succinate dehydrogenase flavopf otein 
430 EST00472 Synaptotagmin (p65) 
1371 EST024O2 Talin 

1771 EST01601 Thiosulfate sulfurtransf erase (rhodanese) 

300 EST00232 Transforming protein (dbl) 

189 EST00282 trkfl 

653 EST01512 Tubulin* alpha 

594 EST01490 Tubulin, beta 

757 EST01542 Tubulin, beta 
1245 EST02274 Tubulin, beta 
1147 EST021C9 Tyrosine kinase 
1701 EST008S3 Une-104 
2121 EST01711 Veline-tRNA ligase 

187 EST00152 Wilm's tumor-fetated protein 
1726 EST01588 XPR2 elkaCne extracellular protease 
249 EST00275 Zinc Finger Proteins 
413 EST01446 Zinc Finger Proteins 
469 EST01460 Zinc Rnger Proteins 
833 ESTO1560 Zinc Rnger Proteins 
1230 EST02259 Zinc finger proteins 
1496 EST02534 Zinc finger proteins 
2324 EST01352 Zinc Finger Proteins 



A24579 P1R 75 63.1 
JQ0771 PIR 74 80.0 
R6RT30 PIR 57 96.5 
R3RT10 PIR 66 97.0 
R3YM10 PIR 36 51.4 
S1 1581 PIR 40 68.3 
A36195 PIR 46 80.8 

A35652 PIR 97 77.5 
HUMSPTB GB 268 67.7 
A3S981 PIR 52 58.5 
BOVSDHFP1_1 GPU 44 100.0 
BOVSDHFP1J GPU 49 92.0 
SY65$HUMAN *SP 27 63.6 
MUSTALINRJ GPU 79 61.2 

ROBO PIR 65 81.8 

TVHUDB PIR 25 65.4 
A35104 PIR 33 67.6 



HUMTUBAG 
HUMTBB5 
HUMTUBBM 
A26561 

HUMECK 

JN0114 
A29871 
HUMQM 
B26955 

S06551 

S00754 

C32891 

S00754 

SO0754 

A34612 

S10397 



GB 223 75.0 
GB 298 93.6 

GB 217 90.4 
PIR 105 88.7 

GB 384 74.3 
NR 36 45.0 
PIR 56 57.9 
GB 228 99.6 
PIR 88 46.1 
PIR 25 57.7 
45 60.9 
34 54.3 
105 67.0 
71 62.5 
50 45.1 
29 56.7 



PIR 
PIR 
PIR 
PIR 
PIR 
PIR 
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There is little redundancy in EST sequencing according 
to the present invention. Of the nuclear-encoded messenger 
RNAs, the most common ESTs were to the 0-actin (0.6% of the 
EST clones) and myelin basic protein genes (MBP, 0.5% of 
the clones) . MBP, a highly expressed structural component 
of nerve tissue (Kamholtz, J., de Ferra, F., Puckett, C, & 
Lazzarini, R. Proc. Natl. Acad. Sci., USA 83: 4962-4966 
(1986)), displays four alternate splicing forms, of which 
at least two are present among the ESTs # reported here . 
Other common ESTs were Gs-alpha gamma-actin and both a- and 
alpha -tubulin. 

By matching ESTs to known database sequences, a 
phenotypic characterization of the tissue begins to emergfc. 
Protein superfamilies matched by ESTs were grouped into 
three broad functional categories to assess the biological 
spectrum represented by these randomly selected cDNA 
clones. Structural and metabolic classes comprised about 
30% of the ESTs with database matches. Twenty- five percent 
were involved in regulatory pathways and the remainder were 
not classifiable. Eleven of the eighteen enzymes of 
glycolysis and the citric acid cycle are represented by at 
least one subunit or isozyme. In addition, several genes 
not previously known to be expressed in the brain were 
matched, including spermine/spermidine acetyltransf erase 
(Casero, R. , Celano, P, Ervin, S., Applegren, N. , Wiest, L. 
& Pegg, A. J. Biol. Chem. 266: 810-814 (1991)) and 
osteopontin (Young, M. , Kerr, J., Termine, J., Wewer, U. , 
Wang, M., McBride, W. & Fisher, L. Genomics 7:491-502 
(1990) ) . 

EXAMPLE 5 

Mapping of ESTs t<"> wnman Chromosomes 

Randomly selected ESTs corresponding to SEQ ID NOs*. 
were assigned to chromosomes via PCR (see Table 3) . 
Oligonucleotide primer pairs were designed from EST 
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sequences to minimize the chance of amplifying through an 
intron. The oligonucleotides were 18-23 bp in length and 
designed for PCR amplification using the computer program 
INTRON (National Institutes of Mental Health, Bethesda, 
MD) . The program is based on the assumptions that: 1) 
introns are genomic sequences that interrupt the coding and 
noncoding sequences of genes (Smith, J. Mol. Evbl. 27:45-55 
(1988)); 2) there are consensus sequences for splice 
junctions (Shapiro, et al . , Nucl. Acids Res, 15:7155-7174 

(1987) ); and 3) that 90% of the human genes studied have 
3 1 untranslated regions of mRNA not interrupted by introns 
in the genomic .DNA (Hawkins, Nucl. Acids Res. 16:9893-9908 

(1988) ) . 

The program evaluates the likelihood that a given GG 
or CC dinucleotide represents a former exon- intron 
boundary. Specifically, every input strand is processed by 
the INTRON program twice, first evaluating the sense mRNA 
strand, and then processing the complementary or anti-sense 
strand. The program evaluates each sequence by finding all 
GG or CC pairs (possible former splice sites) , searching 
for STOP codons in all three reading frames, and analyzing 
the GG or CC pairs surrounded by stop codons, All regions 
of the EST that are unlikely to contain splice junctions 
based on CC content, GG content, and stop codon frequency 
are then marked by the program in uppercase. 

The creation of PCR primers from known sequences is 
well known to those with skill in the art. For a review of 
PCR technology see Erlich, H.A., PCR Technology; Principles 
and Applications for DNA Amplification . 1992. W.H. Freeman 
and Co., New York. ESTs were examined for the presence of 
stop codons in each reading frame and for consensus splice 
junctions. The presence of stop codons and absence of 
splice junction sequences are more characteristic of 3' 
untranslated sequences than of introns. The untranslated 
sequences are unique to a given gene; thus, primers from 
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these regions are less likely to prime other members of a 
gene family or pseudogenes • 

The primers were used in polymerase chain reactions 
(PCR) to amplify templates from total human genomic DNA. 
PGR conditions were as follows: 60 ng of genomic DNA was 
used as a template for PCR with 80 ng of each 
oligonucleotide primer, 0.6 unit of Tag polymerase , and 1 
uCu of a 32 P- labeled deoxycytidine triphosphate. The PCR 
was performed^ in a microplate thermocycler (Techne) under 
the following conditions: 30 cycles of 94 °C, 1.4 min; 55°C, 
2 min; and 72 °C, 2 min; with a final extension at 72 °C for 
10 min. The amplified products were analyzed on a 6% 
polyacrylamide sequencing gel and visualized by 
autoradiography. If the size of the resulting product was 
equivalent to the EST from which the primers are derived, 
then the PCR reaction was repeated with DNA templates from 
two panels of human-rodent somatic cell hybrids; BIOS 
PCRable DNA (BIOS Corporation) and NIGMS Human-Rodent 
Somatic Cell Hybrid Mapping Panel Number 1 (NIGMS, Camden, 
NJ) . 

PCR was used to screen a series of somatic cell hybrid 
cell lines containing defined sets of human chromosomes for 
the presence of a given EST. DNA was isolated from the 
somatic hybrids and used as starting templates for PCR 
reactions using the primer pairs from EST sequences 
selected above. Only those somatic cell hybrids with 
chromosomes containing the human gene corresponding to the 
EST will yield an amplified fragment. ESTs were assigned 
to a chromosome by analysis of the segregation pattern of 
PCR products from hybrid DNA templates. For a review of 
techniques and analysis of results from somatic cell gene 
mapping experiments. (See Ledbetter et al., Genomics 
6:475-481 (1990).) The single human chromosome present in 
all cell hybrids that give rise to an amplified fragment 
represents the chromosome containing that EST. 
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The assignment of 100 ESTs and corresponding genes to 
chromosomes by PCR is shown in Table 3 . 



WO 93/16178 



PCT/US93/01294 



-49- 



Table 3: Assignment of SSTs to Chromosomes by PCR 



£Q ID 


BST# 


c 


5 


EST00012 


1 


57 


EST00058 


1 


64 


EST00066 


1 


83 


EST00079 


1 


83 


EST00079 


1 


91 


EST00086 


1 


105 


EST00365 


1 


109 


EST00095 


1 


116 


EST00100 


1 


141 


EST00118 


1 


220 


EST00372 


1 


237 


EST00187 


1 


242 


EST00192 


1 


259 


EST00202 


1 


269 


EST00293 


1 


299 


EST00249 


1 


1651 


EST00810 


1 


16 


EST00021 


2 


1898 


EST01013 


2 


8 


EST00234 


2 


36 


EST00037 


2 


123 


EST00106 


2 


192 


EST00155 


2 


200 


EST00162 


2 


284 


EST00216 


2 


102 


EST00248 


3 


167 


EST00138 


3 


12 


EST00274 


3 


60 


EST00062 


3 


77 


EST00257 


3 


107 


EST00093 


3 


108 


EST00094 


3 


1706 


EST00857 


3 


37 


EST00038 


4 


6 


EST00013 


4 


37 


EST00038 


4 


31 


EST00033 


5 


28 


EST00030 


5 


59 


EST00061 


5 


74 


EST00073 


5 


121 


EST00104 


5 


149 


EST00123 


5 


235 


EST00185 


5 


1643 


EST00803 


5 


1677 


EST00835 


5 


23 


EST00026 


5 


121 


EST00104 


5 



Chr PRIMER #1 



TCCAGGCAATCCCAGAATAG 

CTGTTTGCAAGTTTCAAAGC 

GCCATTGTGCTGAATAGAGT 

CAGCTAATTGACCTGGGCTA 

GGCAGAGCATAATGAGTATA 

AGTTTAGATGGAGGGCTGTC 

CTTAATCACCTCCCTTTTGT 

AGTCTAATCCTGTACACTTG 

TTAGAAGTGCCCATGGGAGG 

CTCAGAGAAACTTAGGTGAA 

AAGTTGCACATTGCCCAAGG 

TTACAAATTTCTCTTGACGC 

GGATCAGATAATCAAACAGG 

GCATCACAGTTTAACTOAGG 

CTGTTGCTGTGCAGTAGCTT 

GATCATGCAGACGTAGATAT 

TAGTCGCTGTAAGTTGATTC 

CAGGCAAGTTTCTTCCAGGA 

GGCTGAGAACGGTTAGCATA 

TAGAAGGCAAACTATGTCCC 

AGCCAGAAGGCTGCTTAAAG 

GTCTAATTTGTAACCTTCAG 

GATTTATGTCTGGGAACTAA 

TTTAATGGGTGGTGGGAGCT 

CCTAAGAATTCGTTTGGCTC 

ATACTACATCTAGTCTGG 

AAACAGCTGCGGAGTACA 

CCTAG CAAACTCATACACAC 

ACACATTAACGGTGCTGCAG 

AAGCTCACAACGCAGATCTG 

ATTGAACTCTGTCAACAGTG 

AL2 - GCAGGATGTCAGTCTTTTGAG 

AL2 - GCAGGATGTCAGTCTTTTGAG 

AACTTCGCAGTCATGAGAAC 

CACATGTTCTCCCTCTTTCA 

AL2 -GGAAGTACAGGATTTGGC 

TGGGTACCCTAAGGTGTTTG 

AGATAAGTTAGGAAGCTGGT 

AAAGTTTCTTAGCACCCCCC 

ATCAGACACGTGGCAGGGTT 

TGAAGG CAG CTG CTAAATCT 

ATACTGTCAACGGAGGGTGA 

TTACTGTCCCATCAGATATC 

GAGCGTTTAAAAGAGATTCT 

AL2 - TCTCCAACACAGTCATGC 

CCTGCAGTGACACTTAACAT 

AL2 - CAGATCAATACATCCTCTGGG 



PRIMER U2 



CTAATTGAGCTCACTGGCCC 

GCCATTTCTAACAACCAGAG 

GTTAGTGTTTCCTTAG CAAG 

CAACATGCTCTGAGCTTTAG ' 

CATATGCATATGGTCCCTAT 

TCTGCCCTAATGCGCAGGCT 

CCTTAGTTGGAGATAAGGTC 

CGGGCTTTCTCTGAATTGGT 

TTTTAAGGCTCTGGAGTGTT 

CTACAGAATCATTTCACCAG 

ATAGTACTGCAAGGTTATTC 

CTGAAGGAGCACAGTTTCTC 

GCTTAGGATATGAATGCATA 

CTACATATTTGTGCCTCCTT 

CTTTTGACCCAGTGAAACTT 

CCAACTCCTG CCAGATCATT 

GCTTTGCTGGATGCTTCATT 

TCAGACCCATGGTCAGCTT 

CCCTCAGCTTAGGGGAATG 

GGTTGAGGATTGGCTTTTAC 

GCAGTGAACCAGTACTCCTA 

GATAGATTGTATAAGAAGCC 

GCAGCATGTGAAAGAATGAT ' 

CGATGCACATCCTTCTCCAT 

GTCTGGCACATAATAGATTTG 

TTACAGTTCTGTGGTTTC 

AAAGGATCCTCCACTCCAGA 

CATAAGTGAATGGACACAGG 

GGAATCAGCCCTTGAGGACT 

CTGGAACAGCTTACAAAGGT - 

TGTAAAACAAAGGCCAAACT 

AGCACACATTATCTACCACGGC 

CCAGCACACATTATCTACCACG 

TG TATCG GGCAGTTCTCAG 

GCATTTTGGAGCTCTTCCGT 

TTAGAGATGGGATGATGCCG 

GACTAATCTAAGGTCTAGG 

ACTCACTG CTAGTATCATCC 

CAGACTTTGACAAAAGAATC 

AAGTCCCTGAGGGTGCAGAA 

GGATGTATTGATCTGACTCA 

GTCTGCAGGTTTCTCCTTGA 

TACACTCTTAAGAAGGTATG 

TACAGACAG CCATGTXCCAA 

CGGATGCCATCATATACC 

CTG CTCACCTG AAATTG AT AC 

CTGTGCAGTGGTGAGTAAAAGG 
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EST# 


Chr 


PRIMER *1 


1 


T?i o m r\ a a a 

HST00007 


6 


TAGTTGATGGTCTGGGTTAT 


19 


EST00023 


6 


CAACTTACATTAGGGGTTTG 


ICC 

155 


wowaai aa 

EST00129 


6 


GGAAGCTGCCATATAAGCTC 


224 


EST00356 


6 


GCTGTATGTTAACCCTTTGT . 


.2 Bb 


EST00219 


6 


AC 1 1 i CATGTTG AGAAGTAT 


1638 


nnnt a a *w a a 

EST00798 


6 


CTTCATCTGTTAACTGTTGA 


1675 


nMHIA A A A ^ 

EST00833 


6 


AL2 - ACCCAGTTCTCAAAGACC 


O A 

22 


nflmn A A A 

EST00301 


6 


CT C CGTGATTACCTTCATCT 


a rt*r 

207 


ESTQQ167 


7 


GGTGCTAi^l~X"AW'GAATGCT 


137 


EST0U272 


7 


AGTGGTCACTATCTACATGG 


1659 


• n pmA a A ^ a 

EST00817 


7 . 


TGTATAGGCTCTACATAAAG 


1680 


Ct Af»l A A A ^ A 

EST00838 


7 


AL2 - GTTCTTTCCCAGGTATGC 


** a a 
292 


D0niA A A A 

EST00223 


8 


TG CAGCAGTGACCATGAGAA 


134 


n Am a a ^ i p 

EST00375 


9 


TCTGGGCTTCTGTGGTTCAA 


1906 


EST01021 


9 


GGATGTTTTCTATGTGACGA 


1645 


EST00804 


10 


CTCCTTTGGGACAAACAACT 


20 


B^fflA AAA A 

EST00024 


10 


AG CTGTTCCTGAGAGATG CA 


157 


EST00131 


10 


TCAGCAACAGGTCACTTTGG 


172 


EST00142 


10 


TACTAGCATTTCTTACTCTC 


250 


EST00197 


10 


GGTGATTAGAGAGTCTGTTG 


133 


EST00111 


11 


GGAAATTAGGCTTAGCTCAC 


178 


EST00294 


11 


GTTTGAAGGAAGTGATTTCC 


10 


EST00016 


11 


GTCTTTGGATTCTACGTAGA 


126 


EST00109 


11 


AL2 - CTAACCACAACCCACACATTG 


7 


EST00014 


12 


AACTTGCAACATAAATACTAG 


254 


EST00200 


13 


TTGTGTACTGTCTGATAGAC 


2409 


EST00273 


13 


GCAAGATGATGGAACATCCC 


170 


EST00295 


14 


GGTGCTTAAGGCCACTTTTG 


255 


EST00201 


14 


CCAGGAGAGTAAGAAGATCA 


230 


EST00221 


14 


GTGCCAAGATGGCTCATGTA 


293 


EST00224 • 


I 4 


AATGCATTATGCCTGGTCTT 


1664 


EST00822 


14 


GGGTCAGAATTAAGAGGTCT 


. 315 


EST00008 


14 


AAGCTGGCTGGGAAATGTTC 


1689 


EST00845 


14 


AL2 - AGGAGG AAGCTGAAATCC 


95 


EST00088 


15 


GTGACAGACCATGTCTATTG 


205 


EST00165 


15 


AGGATGACCTGAGTGAGCTG 


33 


EST00034 


16 


TGTGTGAAAGGGAGTCTTGT 


247 


EST00279 


16 


TGGCTAGGGCAGGCCTTAAA 


18 


EST00373 


16 


CCATCTGTGTCCCAATTAAGC 


. 68 


EST00068 


17 


CAAAGACGGGAGACGAATGA 


1652 


EST00811 


17 


GAGCTGCATGTTGATAAGTA 


1702. 


EST00854 


17 


AL2 - TTGCTGTGGAATCCATG AGAG 


84 


EST00080 


19 


AGAGATGTCAGTCCATTATC 


223 


EST00368 


19 


CATCATGTCGGAGACGCATT 


21 


EST00025 


20 


AGTTCTGGAGGCTAGGAGTT 


210 


EST00168 


20 


TGtCAACTTCCCTTTGGCCT 


136 


EST00113 


20 


AL2 -TCGGAGAAGTTGCAGTTTCTG 


120 


EST00103 


22 


CACTCACTGACTCCTCTTTA 


313 


EST00276 


X 


ATTGACCTTCAATGTAATAA . 



PRIMER #2 

GAAATCCCAGGGAGAGAATG 

GACCTCATTAGAAGAGCCCA 

TCAGTGTCGTACAATCTACC 

TGGAACCCTCAAACACTGCT 

ATCTAGCTGAAACATTGCTG 

T GAAA ATGAGTCACAGGCAG 

GGTTTACCATTCAGAGGC 

TTGTAGGTATCTCTGTCAGCT 

AG CAATGTG ATTTTGTAGG 

GATTCAGAATTACTAAGCCG 

CTTAATCATGGATTCTTCGT 

TTGTTGGTACTGAGGAAGTGCG 

ATCATCTTTCCACGCGGCTT 

CTGGCTGCTCAGCAACTCAT 

TTCCAGTGCCCCTTTTGTCC 

CCAACCCAAACATATTCTA 

CCTTGTGAAGAAAGACTTTC 

CTAAGCATCTGCATGTCCAG 

TATGCTGATTGTTTGCACTC 

GAACTCTGTAGTGTTCTAAA 

GTGCAGAATACTTAGAGTCC 

TAGGGCCACCTCCAGTTCAT 

CGATAATGACATTTCTTCTGG 

CCTCAGCACAAGAGAAGAATGG 

GAGCAATGATTTCTAACAGT 

TAAGCCATGGGCATCTATAA 

TTCCTTCTGGAGGCTCTACA 

CTTAGAGGATCATAGGTCTG 

GCAGAGTTGAATATGAACCT 

GTATAGCTTTAAGCCAGTTC 

GGAAAAGTCTAGAACTTAGT 

GTTCATCTCTAACTCCTTTC 

GTCATGCTAGTAAACTTACAC 

GGAAGTCCATAAGAGACTCACC 

AAGTGAGCGATTGCACCTTC 

CCA TGGC AGCAAGGAACTCT 

CCATTTTGACTGTTCCATAG 

GAGAAGAATATCAAATGGGG 

AGGGAAGAAGTCTAGAGCGA 

AGTGGAACGCGTGGCCTATG 

TTGACTTAAGCTGACCTTAA 

GGCAAGTGATCTGTTCTTGG 

CTATTCCACCTTACTCAAGG 

TGGATGACCTGAGTCTGCAG 

ATGTAAGGACCCCTAGATGG 

GAAGCTTGCTCATTCAGGAA 

GTTAAAAGCTGTTAGACGGGGC 

GGAACCGTAACTCTCCATAG 

TTGGATTGGGCAAAATAG 
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SSSLJD ESM Chr PRIMER ttl PRIMBR »2 



162 EST00133 X ATGTGAGCATCTATACCTGC AATGAAGGCATGAGAATAGG 

16€9 EST00827 X CGGACAACTAGGATAAATGC TACGCGTTTGAATGGCTTQA 

1917 EST01029 X OAATAGCATTATTAGCCAGT GOACCTATTGGAGATCTACT " 

1708 KST00858 X AL2 - AAGGCGAGGATTATGTGC TTCTACTGGGTACACTTCQACC ' 

Abbreviation: AL2 : Amino -Link -2 Pluorescent Tag, dir.: Chromosome. 
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The foregoing techniques have been used to further 
localize 9 ESTs and their associated genes to precise 
locations onto chromosome 6 or chromosome X, as reflected 
in Table 4A. (in Example 7 below) , using sublocalization 
5 techniques that employ somatic cell hybrids. ESTs were 

used as hybridization probes and mapped to other 
chromosomes using techniques disclosed in Example 7. 
Somatic cell hybrids were prepared that contained defined 
subsets of chromosomes 6 and X. Methods for preparing and 

10 selecting somatic cell hybrids are known in the art. For a 

review of an exemplary procedure to generate somatic cell 
hybrids containing the short arm of human chromosome 6 , see 
Zoghbi, et al., Genomics 9(4):713-720 (1991), For a 
general review of somatic cell hybridization see Ledbetter 

15 et al. ( supra ) . The hybrids were processed to obtain DNA 

and analyzed by PCR and by fluorescence in situ 
hybridization. SEQ ID NOs 19, 22, 1, 224, 288 mapped to 
chromosome 6, while SEQ ID NOs 162, 1917, 1699 and 1899 
mapped to chromosome X using somatic cell hybrids. 

20 EXAMPLE 6 

Mapping of All ESTs to Human Chromosomes 

The procedure of Example 5 is repeated for all of the 
ESTs from Examples 1 and 2 not previously mapped to human 
chromosomes. Data are generated corresponding to the data 

25 in Table 3 for all of the unmapped ESTs. As previously 

mentioned, virtually all of the ESTs will map. to a unique 
chromosomal location. The inability of any ESTs to 
localize to a unique location will be readily ascertainable 
during the mapping process. 

30 Physical mapping of the type reported in Table 4 on 

all the EST clones reported here would provide human 
chromosome markers spaced on average every 1.2 megabases 
and would roughly double the number of expressed sequences 
that have been localized to chromosomes (McKusick, V. FASEB 
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J. 5: 12-20 (1991)) . Mapped ESTs are also a new resource 
to identify candidates for the estimated 5000 single-locus 
disease -associated genes (Id.). 

EXAMPLE 7 

AlternatJvo ^""^"ique for Mapping to < 7 >11 7'7 mr ' 0 omeB 
Mapping of E8Ts to chromosomes using fluorescence in situ 

This technique was used to map an EST to a particular 
location on a given chromosome* Cell cultures , tissue, or 
whole blood were used to obtain chromosomes. 

0.5 ml. of whole blood was added to RPMI 1640 and 
incubated 96 hours in a 5%C02/37°C incubator. 0.05 ug/ml 
colcemide was added to the culture one hour before harvest. 
Cells were collected and washed in PBS. The suspension was 
incubated with a hypotonic solution of KC1 added dropwise to 
reach a final volume of 5 ml. The cells were spun down and 
fixed by resuspending the cells in methanol and glacial 
acetic acid (3:1). The cell suspension was dropped onto 
glass slides and dried. 

The slides were treated with RNase A and washed then 
dehydrated in a series of increasing concentrations of 
ethanol . 

The EST to be localized was nick-translated using 
f luorescently labeled nucleotide (Korenberg, Jr., et al., 
Cell 53(3) J391-400 (1988)). Following nick translation, 
unincorporated label was removed by spin dialysis through 
Sepharose. The probe was further extracted with phenol - 
chloroform to remove additional protein. The chromosomes 
were denatured in formamide using techniques known in the art 
and the denatured probe was added to the slides. Following 
hybridization, the cells were washed. The slides were 
studied under a fluorescent microscope. In addition, the 
chromosomes can be stained for G-banding or Q-banding using 
techniques known in the art. 
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The resulting metaphase chromosomes had fluorescent tags 
localized to those regions of the chromosome that were 
homologous to the EST, Thus, a particular EST was localized 
to a particular region on a given chromosome. In this 
5 manner, SEQ ID NOs 396, 485, 506, 1880 and 1894 were mapped 

using fluorescent in situ hybridization to locations on 
chromosomes 17, 7, 10 and I respectively (See Table 4B 
below). For a review of the technique see Verma et al., 
Human. Chromosomes : A Manual of Basic Techniques . Pergamon 
10 Press, NY (1988), which is hereby incorporated by reference. 

Table 4: Precise Chromosomal Localization of ESTs 







SEQ ID 


EST# 


Map Location 




A. 


19 


EST00023 


6p 


15 




22 


EST00301 








1894 


EST01643 


6p21 






1 


EST00007 


6q 






224 


EST00356 


6q 






288 


EST00219 


6q 


20 




162 


EST00133 


Xpll.21 - Xp21.2 






1917 


EST01029 


Xpll.21 - Xp21.2 






1669 


EST00827 


Xq26 - Xq27.1 






1899 


EST01014 


Xq28 




B. 


1880 


EST01634 


lq32 


25 




465 


EST01466 


7pl3 






■ 506 


EST01471 


10qll.2 






396 


EST01443 


17q25 



EXAMPLE 8 
Automated DMA Sequencing Accuracy 

3° ESTs that match human sequences in GenBahk are 

excellent tools for the analysis of the accuracy of double- 
strand automated DNA sequencing. Ninety EST/GenBank 
matches were examined for the number of nucleotide 
mismatches and gaps required to achieve optimal alignment 

35 by the Genetics Computer Group (GCG) program BESTFIT 

(Devereux et al, Nucleic Acids Research 12: 387 (1984) ) . 
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The number of mismatches, insertions and deletions was 
counted for each hundred bases of the sequence (Table 5) . 
As expected, the sequence quality was best closest to the 
primer and decreased rapidly after about 400 bases. The 
number of deletions and insertions relative to the GenBank 
reference sequence increased five- to ten-fold beyond 400 
bases, while the number of mismatches doubled. The average 
accuracy rate for individual double- stranded sequencing 
runs was 97.7% to 400 bases. 
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TABLE 5, Accuracy Of Single -Run Double -Stranded Automated Sequencing 



Bases from 
Primer 

101 - 200 
201 - 300 
301 - 400 
>400 



Mismatches/ 
Ambiguities '*' 



1.45 
1.72 
2.07 
3.53 



Gaps 
Insertions '* 

0.18 
. 0.25 
0.98 
2.63 



Percent 



Deletions Accurate 



Aligned 



0.19 
0.11 
0.37 
1.06 



Bases 



98.2 8,800 

97.9 8,130 

96.6 5,404 

92.8 3,197 



ESTs statistically identical to known human sequences and those matching 
mitochondrial and ribosomal genes were aligned with sequenced from GenBank . using 
the GCG program BESTFIT . The first 85 nucleotides was polylinker sequence which 
was not aligned with the pBluescript SK reference sequence, tabulation of errors 
began 15 bases into the BESTFIT alignment and thus is reported beginning with 
bases 101-200. Error rates are reported as number of mismatches, insertions, or 
deletions per hundred aligned bases. "Mismatches 0 includes ambiguous base calls. 
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EZAMFLE 9 

Probability of ESTa Containing Coding S equences 

The ESTs of the present invention were statistically 
evaluated using the coding- region prediction program CRM 
via the GRAIL server (Uberbacher, E. & Mural, R. Proc. 
Natl. Acad. Sci. USA, 88: 11261-5 (1991)). The CRM program 
uses a neural network to combine results from several 
different coding regions by looking at different 6 bp 
sequences found in coding exons and in introns. The 
program additionally conducts reading frame searches and 
assesses randomness at the third position of codons. This, 
protocol categorizes sequences as having an excellent, 
good, marginal, or poor probability of containing coding 
regions. The results are reported in Tables 6-9. There 
were 219 ESTs categorized as "excellent" (Table 6) ; 120 
categorized as "good" (Table 7) ; 113 categorized as 
"marginal" (Table 8) ; and 1743 categorized as "poor" (Table 
9) . These results indicate that most ESTs of the present 
invention comprise noncoding regions. 
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Table 6: ESTs with Excellent Probability of Containing Coding Sequence 



SEQ I0# 


EST# 


973 


PST019A7 

Ca 1 VI TOT 


4QA7 

1oW 


CCTAAO/ 4 

EST00941 




EST00014 


979 


EST019OT 


4 ono 

louy 


EST0D943 


7 


980 


PST010QA 

ca IV 1 TT*» 


1B20 


CCTAAAC 4 

EST0095 1 


15 


EST00020 


986 


EST02000 
ca 1 U6UUV 


moo 
locy 


E5T0v93o 


48 


EST00291 


1000 


CCT09014 
ca 1 ucu 1*1 




CST00973 


62 


EST00064 


1004 


ESTOPfllft 
ca 1 ucu 10 


IftAA 


CST0U9B3 


66 


EST00067 


1007 


ca 1 ucuc l 


4 OCX 
lOOO 


ecTAAnon 
EST009oy 


75 


EST00074 


1018 


ca 1 ucu3C 


4(171 
lOf 1 


CCTAAAA/ 

£5100994 


98 


EST00260 


1021 


Ca 1 UCU33 


lOOO 


EST 01 005 


106 


EST00092 


IOTA 


F<JT090Sn 
cai vcujv 


4 Don 
IotO 


CCTAIAA7 

EST01007 


108 


ESI 00094 


1047 


ca 1 wcuoo 


love 


CCf A4AAA 

EST01009 


114 


EST 00098 


1000 


caiuciuy 


1903 


EST01018 


115 


P ST 00099 

C« t MWU77 


1AOA 
IUYO 


CCTfl91 is 

catuci 13 


4 on/ 
1904 


EST01019 


124 

ICH 


PST00107 


111S 
1 1 13 


cai uc 133 


404/ 
1914 


EST01026 


ICO 




111A 
1 1 IO 


cai ucioo 


4A7A 

1930 


EST01040 


ISA 


EST 00 130 


1190 

1 icy 


caiuciuy 


4 A/ / 

1944 


EST01050 


1AA 




1 133 


Ca lire 133 


1949 


EST01054 


1AA 


C9I vw 13f 


4 4/ 4 
11«M 


C6TM1/1 

EST 021 63 


1962 


EST01062 


17A 


PCTAfl90A 


4 4 CI 

1163 


CCTM4 09 

EST 02167 


1973 


EST01071 


170 


ca i uu 143 


1 4OT 

1 IOO 


EST0220B 


1977 


EST01075 


1 03 


ca i uw i**o 


19AT 
1243 


eeTn447) 
CST02Z72 


1982 


EST01080 


901 
cU 1 


C3 1 ww IOO 


1264 . 


cST022y3 


1991 


EST01088 


CU3 


pctaaias 
caiuu 103 


1Z65 


C0Tn4Ml 

EST0Z2V4 


1993 


EST01090 


91S 


caiuui r c 


1266 


ESTD22y3 


2000 


EST01097 


9ta 

C3U 


pctaaiai 
ca i uu io i 


4 OB? 
ICO/ 


EST02317 


2001 


EST01098 


9ST 


ccTfimoo 
cat uuiyy 


ttno 
10U0 


ESTU2338 


2012 


EST01106 


9AT 


ca I wUcuo 


4 TO/ 

1024 


EST02354 


2013 


EST01107 


9AR 
COO 


PCTflATAO 

ca i uuooy 


41//. 

1344 


EST 023 74 


2024 


EST01117 


97n 


PCTAA9A7 


1350 


cST023oo 


2043 


EST01131 


971 


PCTAA9AT 

caiuvcoo 


1305 


E n TA^7Az 

ESTOcoyo 


'2051 


EST01138 


97T 


ca 1 UUcUO. 


4 TOT 
lOOO 


EST0Z415 


2056 


EST01142 


97A 
£/ O 


calUUcl 1 


1399 


e*TAvn 

EST0Z433 


2058 


EST01144 


9A1 


PCTAA91A 
c a 1 uuc 1 h 


1*|U1 


eeTrtO/ tc 
EST 02435 


2059 


EST01145 


CO? 


caTUUcOQ 


4/nC 
1403 


g »TAflf 9#4i 

EST02439 


2064 


EST01149 


333 


CS1UU3T4 


1417 


EST0Z452 


2090 


EST01167 


OOO 


ca 1 uuoyr 


4/C4 

1451 


EST0Z487 


2094 


EST01171 


«o 
337 


CCTAAAAA 

ca 1 uuhuu 


4/C7 

143f 


EST02493 


2116 


EST01192 


TA9 
30C 


caiuwHio 


4/Z.72 

1463 


EST 025 00 


2117 


EST01193 


ooy 


calUUfwU 


1473 


ESTOZ510 


2128 


EST01202 


¥♦1 


CCTflOAAl 
calUUHOl 


4 /.TO 

1479 


ESTOZ316 


2131 


EST01205 


ASA 
*»3*» 


pqtaaaot 
■ ca I UUHyO 


4C4£ 
13 ID 


CCTMCCC 

EST02333 


2134 


EST01208 


ATA 


ca 1 Uu3uy 


4 COO 

13co 


EST 02369 


2144 


EST01216 




calUu3cc 


«C74 

1331 


EST0Z57Z 


2145 


EST01217 




caiUU3cy 


■ CI / 

1544 


EST02586 


2150 


EST01222 


- CIA 


caIUw33Q 


1331 


EST 025 93 


2155 


EST01227 


Sift 

310 




4CCO 
1330 


ESTOZoOl 


2161 


EST01231 


SSI 


caTUla>Oc 


1561 


EST0Z604 


2* -3 


EST01238 


SS9 
33C 


EST 00505 


1581 


EST02625 


2174 


EST01242 


SSO 
33t 


calUAOfU 


1586 


EST 02631 


2176 


EST01244 


582 


EST00592 


1591 


EST02636 


2189 


EST01255 


cno 
Owe 


cSTOOoOo 


1616 


EST02661 


2214 


EST01272 


DUO 


cat UvOUV 


1624 


EST02670 


2225 


EST01278 


608 


ca 1 UU01 1 


1O30 


EST0Z676 


2227 


EST01279 


C94 

OC1 


EST 00620 


1637 


EST00796 


2233 


EST01284 


ATS 


caiuuocy 


1639 


EST00799 


2235 


EST01286 


A/. 9 
OHC 


cbiUUOOH 


1649 


EST00808 


2236 


EST01287 


AAA 


caiUuooo 


1651 


EST00B10 


2255 


EST01302 


AA7 


calUUOr 1 


1677 


ESTO0B35 


2259 


EST01304 


700 


Ca 1 WWOQ3 


4 AA9 
lOOc 


cSTOUoor 


2263 


PA«M «1 4tA*« 

EST01307 


743 


EST00714 


1694 


EST00849 


SEQ ID# 


EST# 


7ST 


CQTAA791 

caiuur c 1 


4 7ni. 

1 fUo 


CffTAA0C7 

EST00857 




7An 


CCTAA79C 

caiuurco 


1708 


EST 00858 


2267 


EST01756 


764 


ES700729 


1710 


EST00860 


99R1 
ceo 1 


CCTA1T91 
CalUlOCl 


808 
823 


EST00761 


1716 


EST00865 


2283 


EST01322 


EST01864 


SEQ ID# 


EST# 


2300 


EST01333 


834 


EST00771 






2303 


EST01335 


886 


EST01886 


1718 


EST00867 


2303 


EST0133S 


*19 


EST01921 


1731 


EST00879 


2314 


EST01345 


930 


EST01933 


1742 


EST00887 


2334 


EST01358 


SEQ ID# 


EST# 


1746 


EST0089T 


2339 


EST01362 


936 


EST01939 


1760 


EST00903 


2342 


EST01365 


1767 


EST00907 


2348 


EST01371 


948 


EST01957 


1769 


ESTO0909 


2358 


EST01379 


965 


EST01978 


1777 


EST00913 


2367 


EST01388 



2373 EST01393 

2374 EST01394 

2393 EST01417 

2394 EST01418 
2396 EST01420 
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Table 7: BSTs with Good Probability of Containing Coding Sequence 



SICLlEft isat 

20 EST00024 

72 EST00071 

82 EST00078 

88 EST00084 

137 EST00272 

177 EST00328 

* 193 EST00156 
200 EST00162 
218 EST00175 

• 228 EST00179 
m 247 EST00279 

264 EST00204 

2*7 EST00297 

296 EST00228 

371 EST00426 

385 EST00436 

392 EST00442 

414 BST00460 

433 EST00474 

453 EST00492 

471 EST00505 

496 EST00525 

524 EST00544 

526 EST00546 

529 EST00549 

549 EST00563 

557 EST00569 

578 EST00588 

596 EST00602 

607 EST00610 

619 EST00619 

657 EST00646 

660 EST00649 

689 EST00673 

695 EST00679 

699 EST00682 

729 EST00703 

742 EST00713 

747 EST00717 

755 EST00723 

759 EST00725 

776 EST00738 

778 EST00740 

782 EST01551 

829 EST00768 

835 EST00772 

836 EST00773 
862 EST01872 
881 EST01881 

SEP IDS EST* 

884 EST01884 

924 EST01926 

929 EST01932 

938 EST01941 

971 EST01985 

995 EST02009 

996 EST02010 
1031 EST02046 



XV* X 


AOXU«U3 / 






i noo 

1U77 


aoTDZllo 


11U9 




ill! 




111Q 
XXj J 


JSoxuziox 


1146 


ESTQ2100 


1 1 Af 

119o 


EST02221 


1210 


HMHIM ft ft ft 

EST02238 


1233 


EST02262 


1285 


EST02314 


1331 


EST02361 


1388 


BOfftft ft il ft ^ 

EST02421 


1418 


EST02453 


1439 


EST02475 


1502 


EST02540 


1537 


EST02578 


1563 


EST02606 


1599 


EST02644 


1602 


EST02647 


1693 


EST00848 


1695 


EST00850 


1729 


EST00877 


1730 


EST00878 


1738 


EST00883 


1739 


EST00885 


1743 


EST00B88 


1768 


EST00908 


1780 


EST00916 


1604 


EST00938 


1805 


EST00939 


1811 


EST00945 


1819 


EST00950 


1826 


EST00956 


1830 


EST00959 


1645 


EST00971 


1848 


EST00974 


1853 


EST00977 


1967 


EST01066 


1992 


EST01089 


1994 


EST01091 


SEO ID* 


ESflf 


1997 


EST01094 


2046 


EST01134 


2101 


EST01177 


2102 


EST01178 


2105 


EST01181 


2106 


EST011B2 


2141 


EST01213 


2184 


EST01251 


2196 


EST01260 


2203 


EST01264 


2232 


EST01283 


2308 


EST01339 


2345 


EST01368 


2346 


EST01369 


2351 


EST01373 


2354 


EST01375 


2355 


EST01376 


2359 


EST01380 



2362 EST01383 

2378 EST01397 

2399 EST01423 

2407 EST02714 
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Table 8: ESXs with Marginal Probability of Containing Coding Sequence 



P E Q IP ff 




1222 


EST02251 






1224 


EST02253 


11 


EST00018 


1228 


EST02257 


1^ 


S5TQQ274 


1267 


EST02296 


OA 
<£4 


EST0Q027 


1301 


EST02331 


45 


EST00364 


1397 


EST02431 


79 


EST00076 


1448 


EST02484 


« 90 


mi iiia a a 

EST00302 


1480 


EST02517 


110 


EST00096 


1493 


ES702531 


lv44 


EST00120 


1499 


EST02537 


.-45 


EST00121 


1503 


EST02541 


•92 


EST00155 


1527 


EST02568 


- 222 


EST00177 


1536 


EST02577 


234 


EST00184 


1548 


EST02590 


277 


EST00212 


1562 


EST02605 


319 


EST00381 


1572 


EST02615 


368 


EST00423 


1575 


EST02618 


370 


EST00425 


1595 


EST02640 


387 


EST00438 


1608 


EST02653 


402 


EST00451 


1610 


EST02655 


415 


EST00461 


1621 


EST02667 


418 


EST00464 


1627 


EST02674 


426 


EST00470 


1629 


EST02677 


503 


EST00528 


1631 


EST02678 


517 


EST00539 


1683 


EST00840 


522 


BST00543 


1692 


EST00847 


532 


EST00551 


1751 


EST00895 


540 


EST00557 


1756 


EST00900 


570 


EST00580 


1764 


EST02690 


573 


EST00583 


1770 


EST00910 


S76 


EST00586 


1793 


EST00929 


613 


EST00615 


1847 


EST00973 


617 


EST00617 


1877 


EST00998 


626 


EST00622 


1897 


EST01012 


681 


EST00665 


1900 


EST01015 


726 


EST00700 


1939 


EST01655 


727 


EST00701 


1940 


EST01046 


738 


EST00711 


1954 


EST01058 


745 


EST00715 


SEO ID* 


EST* 


. 752 


EST00720 




791 


EST00746 


1990 


EST01087 


795 . 


EST00749 


2008 


EST01103 


803 ' 


EST00756 


2031 


EST01123 


845 


EST00777 


2041 


EST01130 


852 


EST00782 


2044 


EST01132 


854 


EST00784 


2060 


EST01146 


907 


EST01907 


2100 


EST01176 


912 


EST01912 


2136 


EST01210 


935 


EST01938 


2153 


EST01225 


SEQ ID# 


EST* 


2204 


EST0i265 






2212 


EST01270 


968 


EST01981 


2248 


EST01297 


985 


EST01999 


2250 


EST01299 


988 


EST02002 


2266 


EST01310 


1043 


EST02059 


2309 


EST01340 


. 1081 


EST02100 


2347 


EST01370 


1089 


EST02108 


2388 


EST01406 


1116 


EST02136 


2398 


EST01422 


1134 


EST02154 


2405 


EST01427 


1205 


EST02233 
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Table 9: ESTs with Poor Coding Probability 



SEQ ID# 


ESTtf 


103 


EST0Q317 


204 


EST0O235 


309 


EST00174 


404 


EST00453 

CO I wwvww 




104 


EST00354 


206 


EST00166 


315 


EST 00008 


405 


EST00454 


1 


EST00007 


105 


EST0Q365 


207 


EST00167 


316 


EST00378 


406 


EST00455 


2 


EST00009 


107 


EST 00093 


209 


EST00331 


317 


EST0O379 


407 


EST00456 


3 


EST00010 


109 


EST00095 


210 


EST00168 


318 


EST00S8O 

m W ff vwllV 


408 


EST 00457 


4 


EST00011 


111 


EST00281 


211 


EST00332 


320 


EST00382 


409 


EST01444 


5 


EST00012 


112 


EST0G318 


212 


EST00169 


321 


EST00383 

COI WWW WW 


410 


FST00658 
ca i uutwo 


6 


EST00013 


113 


EST00097 


213 


EST00170 


322 


EST0Q384 


411 


FST0Q459 


8 


EST00234 


116 


EST00100 


214 


EST00171 


323 


EST00385 

CO 1 uwwuw 


412 

** IC 


FST01AA5 
Colli 1443 


10 


EST00016 


117 


EST0Q319 


216 


EST00173 


325 

wcw 


ESTQ0386 

CO 1 UUJUU 


416 

*t IO 


Co 1 UWD4 


14 


EST00019 


118 


EST00101 


219 


EST00176 

by ■ WW UU 


-326 


FST003fl7 


617 


CO 1 UUlOw 


16 


EST00021 


119 


EST00102 


220 


EST00372 

SO 1 VMII C 


327 


EfiT0T1388 


A10 


eernnAAfi 

COluv*03 


17 


EST00022 


120 


ESTO01O3 


221 


EST00359 


328 


EST00389 

COIUwwOT 


620 


eernflAAA 
CollHWOO 


18 


EST00373 


121 


EST 001 04 


224 


EST00356 


329 


EST0O39O 

COI www TU 


A21 


ceTAftAA7 


19 


EST00023 


122 


EST00105 

co i w i ww 


225 


ESTQQ178 

CO 1 UW 1 f O 


330 
www 


coIvuOtI 


4cc 


eeTA4AA7 


21 


EST00025 


123 


EST00106 


226 


EST00333 

COI VUMJ 


-331 

* WW 1 


FSTn0392 

CO 1 UUJ7C 


**cw 


CSIlHMOO 


23 


EST 00026 


12$ 


EST00108 


229 


EST 00 180 
ca i ww iou 


332 

wwC 


Co 1 vUwTw 




EST0144O 


25 


EST00028 


126 


EST 001 09 


231 


EST00334 


334 

WW*! 


ColUUwT7 




FCTAAAAO 
coIUUwr 


27 


EST00029 


127 


ESTO032O 


232 


EST001B2 

CO 1 WW IOC 


335 

www 


ESTQ0306 


627 


pern 1 AAO 


28 


EST00030 


129 


EST0Q321 


233 


EST00183 

CO 1 WW IQW 


337 

WW f 


Co 1 wUwTO 


4cO 


CoIUl43 1 


29 


EST00031 


130 


EST00355 

W w 1 WWW WW 


235 


FST00185 

Col VW IQJ 


340 


Cal UUhUc 


*CT 


ECTOflA71 

calUwrl 


30 


EST0G032 


131 


EST00322 


936 


FST001R6 

Col UU IOD 


361 
w4 1 




Hwl 


cSTuU4r3 


31 


EST00033 

CO 1 UiMM 


133 


EST001 1 1 

C9IW III 


237 


F&TAA1A7 
Col Uu lor 




cerrVvifiA 


/TO 


ESTD143Z 


32 


EST 00233 

CO I WWMJ 


134 


EST 00375 

CO I WWWf J 


238 


Col uu too 


w4** 


Co 1 W3*kUj 


ATA 


cal0vAr3 


33 


ESTQ0034 

CO ■ WVthn 


135 


EST00112 

CO ■ uu tic 


239 


Col UU IOT 


w43 


Col UU*» UD 


i7C 


CCTAA/?/ 

cSTUuArO 


34 


ESTC0035 


136 


EST001 13 

CO 1 W 1 1/ 


240 


FST00335 

COI Www w 


3A7 


ColwlOcT 


1JO 


cernn/77 


35 


EST 00036 


138 


EST001 14 

W *J I W I 1^ 


241 


EST00191 
ca i uu |7 i 


348 

w40 


Col O IOwU 


ATT 
•Of 




36 


EST00Q37 


139 


EST001 16 


242 


EST0019? 

Cot WW.ITC 


349 

JIT 


FST01831 
C9I vlOw 1 


A3A 
HwO 


CalUU4rT 


39 


EST00Q39 


140 


EST00117 


243 

c"tw 


EST00193 
cai uu itw 


350 

w?U 


Col WWf 


AXO 


cerftftAftn 
cSluvwO 


40 


EST00040 


141 


E5T001 18 

CO 1 W I IW 


244 

C*f*t 


EST 00 196 

Col UU IT*» 


351 
i 


ColUIWUO 


aao 


Colli 1434 


41 


EST00041 


142 


F ST 00393 




FST003A7 
ca i uujn f 


352 

w3£ 


ColUwAUT 


AA9 
Vtc 


CSIU1430 


42 


EST 00042 


143 


EST00119 

CO 1 W 117 


246 


EST 00196 

COI UU ITO 


353 

www 


Col UU% IU 


663 


FQTnAAA9 
CoiUU4Qc 


46 


EST00044 


146 


EST00122 


250 


EST 97 

Col UUITf 


356 


P&T01633 


AAA 


cerAAAAT 
CoIUwmOw 


47 


EST00046' 


147 


EST00292 


252 


EST00198 
ca i uu ito 


355 


EST 00611 

C3IUU1I 1 


666 


FCtTAAAftC 
Colwv4o3 


49 


EST00047 


148 


EST00236 


254 


col uucuu 


356 

www 


PST00612 


AA7 


EeTAAAAA 
C01UU400 


SO 


EST00048 


149 


EST00123 

CO 1 WW 1 CW 


255 


ca i uucu i 


357 


FST00613 

C9IW1 Iw 


AAA 


FCTAAAA7 

Caiwu40r 


51 


ESTO0Q49 


150 


EST00124 

CO 1 WW 1 C"» 


256 


ESTO0345 


358 

wwO 


Catww^ I 1 * 


AAO 


CCTAAAAA 
COIUU400 


52 


EST00052 


151 


E5T00125 


257 
i 


FST00337 

ColUUwwf 


350 


C9lUw4l3 


icn 
■OU 


COT AAA AO 


53 


EST00054 


152 


EST00126 

C9 1 WW 1 CO 


250 


EQXAA9A9 
Col uucuc 


36A 
wOw 


CCTftAAIA 
caiUuiiO 


431 


EST00490 


54 


EST00055 


153 


EST00127 

1 WW 1*1 


260 


EST00357 

CO 1 UUWf 


361 

wO I 


PST00617 
Col uui i r 


AC9 


CCTAAXOi 
COIUW4T1 


55 


EST00056 


154 


EST 001 2 A 

by I ww 1 cw 


261 


FST0033A 
ca i wwwwu 


363 

www 


Col W9*9 IT 


A^C 


eeTAAAOA 
ColwU4T4 


56 


EST00057 


155 


EST00129 

IW w 1 WW wmmW 


262 


EST00339 
ca i wujjt 


366 

wQ*l 


FST00620 

Co 1 UU1CU 


657 


parnAAOC 

Co 1 UU4TJ 


57 


ESTQQ058 


157 


EST 001 31 
ca i uu i w I 


265 


C9IUUCU3 


w03 


FCTA1A3A 


458 


CCTAAACkA 
CaTuwVD 


58 


EST00059 


I/O 


FSTM132 


266 
coo 


col wwcuo 


tax 
wOO 


ESIUWCI 


/CO 

H3t 


ceTAAA<17 


59 


ESTOQ0&1 

ESI UWVO 1 


159 


EST 00325 

Ca 1 UUJ63 


272 
c» c 


EST00360 

Co 1 UWtU 


wOr 


FCTAflA99 
callHWcc 


AAA 


CSTUl43f 


60 


per 00062 


160 

IOU 


F ST 00326 

C9IUWCO 


276 


FST0076A 


3AO 

woy 


CalUWc<t 


AA1 
HOI 


CCTA1 ATA 


63 


EST 00065 


162 


EST 001 33 

btf 1 WW 1 WW 


275 


EST 00209 
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CD 1 UcUC 3 


11 AO 
live 


eornof oi 
EST Uc 111 


4 4A4. 
1194 


EST 022 19 


1286 


EST02316 


1377 


EST02408 


lulu 


BCTA9A9A 

cSTUcUch 


11 A/. 

1104 


cSTOZIZS 


4 4 AC 

1195 


EST 02220 


1288 


EST02318 


1378 


EST02409 


4A4 4 

1011 


ESTU20Z5 


1106 


EST02125 


1197 


EST 02222 


1289 


EST02319 


1379 EST02410 


1012 


EST0Z026 


1107 


EST02126 


1198 


EST02223 


1290 


EST02320 


1380 


EST02411 


1013 


EST02027 


1108 


EST 02 127 


1199 


EST02224 


1291 


ESTQ2321 


1381 


EST02413 


1014 


EST 02028 


1109 


EST02128 


1200 


EST02226 


1292 


EST02322 


1382 


EST02414 


1015 


EST02029 


1110 


EST02129 


1201 


EST02228 


1293 


EST02323 




1016 


ESTQ2Q30 


1111 


EST02131 


1202 


EST02229 


1294 


EST02324 






1017 


EST02031 


1112 


EST02132 


1203 


»«14P M*UM 

EST02230 


1295 


EST02325 






1019 


ESTQ2Q33 


1114 


EST02134 


1204 


EST02232 


1296 


EST02326 






1022 


EST02G36 


1117 


EST02137 


1206 


EST02234 


SEQ IOJ 








1023 


EST 02037 


1119 


^■*vA^ * V A 

EST02139 


1207 


EST02235 






1024 


EST02O3S 


1120 


EST02140 


1208 


EST02236 


1298 


EST02328 






1025 


EST02040 


4 444 
1121 


EST02141 


1209 


EST02237 


1299 


EST02329 






1026 


EST02041 


1122 


A** ■ # •» 

EST02142 


SEQ 1D# 


EST* 


1300 


EST0Z330 






1027 


ea*Aaa# 1 

EST02042 


1123 


EST02143 


1302 


EST02332 






1028 


EST02043 


1124 


EST02144 


1211 


EST02239 


1303 


EST02333 






1029 


EST02044 


1125 


EST02145 


1212 


EST02240 


1304 


EST02334 






1030 


EST02045 


SEQ ID# 


EST# 


1213 


EST02241 


1305 


ESTQ2335 






1032 


EST02048 


1214 


EST02242 


1306 


EST02336 






1033 


EST02049 


1127 


EST02147 


1215 


EST02244 


1307 


EST02337 






1036 


EST02052 


1128 


EST02148 


1216 


EST02245 


1309 


EST02339 






SEQ ID# 


EST* 


1130 


EST02150 


1217 


EST02246 


1310 


EST02340 






1131 


EST02151 


1218 


EST02247 


1311 


ESTQ2341 






1037 


EST 02 053 


1132 


EST02152 


1219 


EST02248 


1313 


EST02343 






1036 


EST02054 


1135 


EST02155 


1220 


EST02249 


1314 


EST02344 






1040 


EST02056 


1136 


EST02156 


1221 


EST02250 


1315 


EST02345 






1042 


EST02058 


1137 


EST02157 


1223 


ESTQ2252 


1316 


EST02346 






1044 


EST02060 


1138 


EST02159 


1225 


EST02254 


1317 


EST02347 






1045 


EST02061 


1140 


EST02162 


1226 


EST02255 


1318 


EST02348 






1046' 


EST02062 


1142 


EST02164 


1227 


EST02256 


1319 


EST02349 






1048 


EST02064 


1143 


EST02165 


1232 


EST02261 


1320 


EST02350 






1049 


EST02065 


1144 


EST02166 


1234 


EST02263 


1321 


EST02351 






1050 


EST02066 


1145 


EST02167 


1235 


EST02264 


1322 


EST02352 






1051 


EST02067 


1148 


EST02170 


1236 


EST02265 


1323 


EST02353 






1052 


EST02068 


1149 


EST02171 


1237 


EST02266 


1325 


EST02355 






1053 


EST02069 


1150 


EST02172 


1238 


EST02267 


1326 


EST02356 






1054- 


"EST02070 


1152 


EST02174 


1239 


EST02268 


1327 


EST02357 






1055 


EST02071 


1153 


EST02175 


1240 


EST 02269 


1328 


EST02358 






1056 


EST02072 


1154 


EST02176 


1241 


EST02270 


1329 


EST02359 






1057 


EST02073 


"1155 


EST02177 


1242 


EST02271 


1330 


EST02360 






1058 


EST02074 


1156 


EST02178 


1244 


EST02273 


1333 


EST02363 






1059 


EST02075 


1157 


EST02180 


1246 


EST02275 


1334 


EST02364 






1060 


EST02076 • 


1158 


EST02181 


1247 


EST02276 


1335 


EST02365 






1061 


EST02078 ' 


1159 


EST02182 


1248 


EST02277 


1336 


EST02366 






1062 


EST02079 


1160 


EST02183 


1249 


EST02278 


1337 


EST02367 






1063 


EST02081 


1161 


EST02184 


1250 


EST02279 


1338 


EST0236B 






1064 


EST02082 


1162 


EST02185 


1251 


EST02280 


1339 


EST02369 






1065 


EST02083 


1164 


EST02188 


1252 


EST02281 


1342 


EST02372 






1066 


EST02084 


1165 


EST02169 


1253 


EST02282 


1343 


EST02373 






1067 


EST02085 


1166 


EST02190 


1254 


EST02283 


1345 


EST02375 






1068 


EST02086 


1167 


EST02191 


1255 


EST02284 


1346 


EST02376 






1070 


EST02088 


1168 


EST02193 


1256 


EST02285 


1347 


EST02377 






1071 


EST02089 


1169 


EST02194 


1257 


EST02286 


1349 


EST02379 






1072 


EST02090 


1170 


EST02195 


1258 


EST02287 


1350 


EST02380 






1073 


EST02091 


1171 


EST02196 


1259 


EST02288 


1351 


EST02381 






1074 


EST02092 


1172 


EST02197 


1260 


EST02289 


1352 


EST02382 






1075 


EST0Z093 


1173 


EST02198 


1261 


EST02290 


1353 


EST02383 






1076 


EST02094 


1174 


EST02199 


1262 


EST02291 


1354 


EST02384 






1077 


EST02096 


1175 


EST02200 


1263 


EST02292 


1355 


EST02385 






1078 


EST02097 


1176 


EST02201 


1268 


EST02297 


1357 


EST02387 






1079 


EST02098 


1177 


EST02202 


1269 


EST02298 


1358 


EST02388 






1080 


EST02099 


1178 


EST02203 


1270 


EST02299 


1359 


EST02390 






1082 


EST02101 


1179 


EST02204 


1271 


EST02300 


1360 


EST02391 






1084 


EST02103 


1180 


EST02205 


1272 


EST02301 


1361 


EST02392 






1085 


EST02104 


1182 


EST02207 


1273 


EST02302 


1362 


EST02393 







WO 93/16178 
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S£Q I0# 


EST* 


1485 


EST02522 


1592 






1486 


EST02523 


1593 


1384 


EST02416 


1487 


EST02524 


1594 


1387 


EST02419 


1488 


EST02525 


1596 


1389 


EST02422 


1489 


EST02526 


1597 


1390 


EST02423 


1490 


EST02527 


1598 


1391 


EST02424 


1491 


EST02529 


1600 


1392 


EST02425 


1494 


EST02532 


1601 


1393 


EST02426 


1497 


EST02535 


1603 


1394 


EST02427 


1498 


EST02536 


1604 


1396 


EST02430 


1501 


EST02539 


1605 


1398 


EST02432 


1504 


EST02542 


1606 


1400 


EST02434 


1506 


E5T02545 


1607 


1402 


EST02436 


1507 


EST02546 


1609 


1403 


EST02437 


1508 


EST02547 


1611 


1404 


EST02438 


1509 


EST02548 


1612 


1406 


EST02440 


1510 


EST02549 


1613 


1407 


EST02441 


1512 


EST02551 


1614 


1410 


EST02444 


1513 


EST02552 


1615 


1411 


EST02445 


1514 


EST02553 


1617 


1414 


EST02448 


1515 


EST02554 


1618 


1415 


EST02449 


1517 


EST0255B 


1619 


1416 


EST02450 


1518 


EST02559 


1620 


1419 


EST02454 


1519 


EST02560 


1622 


1420 


EST02456 


1520 


EST02561 


1623 


1421 


EST02457 


1521 


EST02562 


1625 


1422 


EST02458 


1522 


EST02563 


1626 


1423 


EST02459 


1523 


EST02564 


1628 


1424 


EST 02460 


1524 


EST02565 


1632 


1425 


EST02461 


1525 


EST02566 


1633 


1426 


EST02462 


1526 


EST02567 


1ATA 


1428 


EST02464 


1529 


EST02570 


1635 


1429 


EST02465 


1530 


EST02571 


1636 


1431 


EST02467 


1532 


EST02573 


1638 


1432 


EST02468 


1533 


EST02574 


1640 


1433 


EST02469 


1534 


EST02575 


1641 


1434 


EST02470 


1535 


ESTD2S76 


1642 


1435 


EST02471 


1538 


EST 02579 


1643 


1436 


EST02472 


1539 


EST025B0 


1645 


1437 


EST02473 


1540 


EST02581 


1646 


1438 


EST02474 


1541 


EST02582 


1647 


1440 


EST02476 


1542 


EST02583 


1648 


1442 


EST02478 


1545 


EST02587 


1650 


1443 


EST02479 


1546 


EST02588 


1652 


1444 


EST02480 


1547 


EST02589 


1653 


1445 


EST02481 


1549 


EST02591 


1655 


1446 


EST02482 


1550 


EST02592 


1656 


1447 


EST02483 


1552 


EST02594 


1657 


1450 


EST02486 


1553 


EST02595 


1658 


1452 


EST02488 


1554 


EST02597 


1659 


1453 


EST02489 


1555 


EST02598 


1660 


1454 


EST02490 


1556 


EST02599 


1661 


1455 


EST02491 


1557 


EST02600 


1662 


1456 


EST02492 


1559 


EST02602 


1663 


' 1458 


EST02495 


1560 


EST026Q3 


1664 


1459 


EST02496 


1564 


EST02607 


1665 


1460 


EST02497 


1565 


EST02608 


1666 


1461 


EST02498 


1568 


EST02611 


1668 


1462 


EST02499 


1569 


EST02612 


1669 


1464 


EST02501 


1570 


EST02613 


1670 


1466 


EST02503 


1571 


EST02614 


1671 


1467 


EST02504 


1573 


EST02616 


1672 


1469 


EST02506 


1574 


EST02617 


1673 


1470 


EST02507 


1576 


EST02619 


1674 


1471 


EST 02508 


1577 


EST02620 


SEQ ID* 


1472 


EST02509 


1578 


EST02621 




1474 


EST02511 


1579 


EST02&22 


1675 


1475 


EST02512 


1580 


EST02623 


1676 


1476 


EST02513 


SEQ ID# 


EST* 


1678 


1477 


EST02514 






1679 


1481 


EST02518 


1582 


EST02626 


1680 


1482 


EST02519 


1583 


EST02628 


1684 


SEQ ID# 


EST* 


1584 


EST02629 


1685 


1483 




1585 


EST02630 


1686 


EST02520 


1587 


EST02632 


1687 


1484 


EST02521 


1590 


EST02635 


1688 



-64- 



EST 02637 


1689 


ESTQ08A5 


1799 


COlUwYoH 


EST02638 


1690 


EST00846 


1800 


EST00935 

CO 1 


EST02639 


1691 


EST01577 


1801 


EST0093A 

CO 1 VV7JO 


EST02641 


1696 


EST00851 


1802 


EST 0093 7 

COl WW7J/ 


EST02642 


1697 


EST0O852 

CO 1 UVUJC 


1803 

IWWO 


CO 1 V IO %m> 


EST02643 


1702 


EST008!54 

CO 1 WWOJt 


IOUO 


FSTOnOAO 
COl UU7HU 


EST02645 


1703 


EST00A55 


1808 


COl UU7HC 


EST02646 


1705 


EST00856 

CO 1 VWBVV 


1810 


CO 1 UU7*t*f 


EST02648 


1707 


ESTQ1581 

BO 1 V 4^0 1 


1812 


FCT07A07 

CO 1 uXOTJ 


EST02649 


1709 


EST00859 


1813 


ESTfiOOAA 


EST02650 


1711 


EST00861 


1814 


EST009A7 

CO 1 ww7*f 


EST02651 


1712 


EST00862 


1815 


EST01&15 

CO 1 V IO Iw 


EST02652 


1713 


EST00863 


1816 


EST009AA 

COl UV7HO 


EST02654 


1714 


EST0Q864 

CO 1 vwovt 


1817 


FQTflOOAO 
CO J WU74Y 


EST02656 


1717 


FST008AA 

CO 1 VWWWW 


IO IO 


ColUlOlO 


EST02657 


1719 


FSTGOAAA 
CO 1 wow 


1871 

IOC 1 


FCTftnOC9 


EST02658 


1720 


FSTOOftAO 
co i vuooy 


187? 

1 Out. 


Co 1 UU790 


EST02659 


1721 


EST00870 

GO 1 VWOf W 


1823 


CO 1 UU79*I 


EST02660 


1722 


COIWWOf 1 


187 A 


ColUlOlf 


EST02662 


1723 


EST0D872 

CO 1 WWwf c 


1875 

IOCS 


COIUV739 


EST02663 


1724 


EST00&73 


1827 


F5T01A18 

COl V ID IO 


EST02665 


1725 


EST00874 


1828 


EST00957 

CO 1 VVwJt 


EST02666 


1727 


EST00875 

h«> | WWWf ^ 


1831 


FST01A.19 

CO ■ U ID 17 


EST02668 


1728 


EST0Q876 

EO 1 VWfU 


1832 


CO 1 VU70V 


EST02669 


1732 


F ST 01 590 

CO 1 V IJTV 


1833 

IOJJ 


Co t UUVO 1 


EST02672 


1733 


FST01591 

CO IVIrfT I 


1835 


Co 1 UUTOC 


EST02673 


173£ 


CO 1 UUOOv 


lOOO 


CCTA1A99 
ColUlOCC 


EST02A75 


1735 
1 »09 


BolUUoOl- 


1ITT7 


CoiUwYO) 


EST07A70 

COIUCOf 7 


1 f JO 




1030 


cSTQU7o4 


EST07AAA 

C9 1 VCOOw 


If Of 


CoIUwoOC 


IOOT 


ESTUvV09 


FCT07AA1 

CO 1 VCOO 1 


IfHV 


CSTUcOOr 


low 


EST0W66 


FST02AA? 


17A1 


col Uvooo 


IOH 1 


cSTUUTwr 


FST02AAA 

CO 1 WfOO*l 


17AA 


ceTflflARO 
CoIvvOOt 




CSTwUTOO 


FSTQ079A 

CO 1 UUfTO 


17A5 
1 rH9 


CCTOAAQn 


1fiAT 


ECTAAA<A 


FSTOflAAA 
CO 1 wwouu 




CCTAAA09 

, to | Wore 


low 


cST00y7u 


EST00801 


• 1748 

1 f "tO 


CO 1 WOrJ 


1AAA 


coiuurrc 


ESTQ0802 

b«l 1 WWWVfc 


17AO 


FCTA150X 
CO 1 V 1 3TO 


103U 


CO IV IOC** 


EST00803 

Wwl WWW 


1750 
1 r Jw 


CO I WWvW 


1A<£1 
1D71 


CoIUUrrO 


PCTQQftQA 

CO 1 WwWH 


1 1 JC 


FQTnflAOA 
COlUUOTO 


1ACA 


cSTUUrro 


ESTQ08A5 

KO 1 WWUW^ 


1755 


coi uwoyf 


. 1039 


CCTAA070 

coiuuy/y 


ESTOQfiOA 


1754 


COl wwOTw 


1A^7 
lOv/ 


CCTAAOftA ■ 


EST00807 


1755 


F. ST 00 899 

C9IUUOT7 


1858 


Co i uuyo i 


EST00A09 

fcO 1 WWWWT 


1757 


COlU 137*9 


1097 


CCTAAOQ9 

cSTiniyoc 


EST0081 1 ' 


1758 


CO 1 VV7U | 


1AA1 
lOOl 


CCTAAOftA 

co i uuyo*» 


EST0Q812 

bO 1 WWW 1 C 


1750 


CO 1 UwTUC 


1AA9 


CCTAAOAC 
ColUUTO? 


EST0Q813 


1761. 


CO 1 U I370 


lOOo 


CCTAAOftA 
ColUi/TOO 


ESTQ0814 


1762 


FSTOOOflA 


1AAA 


CCTAA0A7 
Co 1 UUVOf 


EST 00815 

k W I WWW 1 w 


1763 

1 Two 


ceTnnonc 

Co 1 WU7U3 


1AA5 


CCTAAOAA 

coiuuyoo 


ESTQ0816 

fa V 1 WWW 1 W 


17A5 


Co 1 U lOUU 


1867 


EST00990 


EST 008 17 

CO 1 WVU 1 f 


17AA 
1 rOQ 


Co 1 UU7UO 


1AAA 

looo . 


CCTAAOOI 

cSTOuyyi 


EST00818 

CO 1 WWB IO 


1772 

If f£ 


Co 1 UC07 1 


1A7n 
10/ U 


CCTAAOOI 

csiuuyyo 


EST0Q819 

Hi WWW 1 W 


1773 


Co 1 W7 1 1 


1A79 
lOrc 


CCTAAOCK 

to i uuyy? 


EST 00820 

Cw | WWWCU 


177A- 


F<sTnnoi2 
co i uuy \c 


lOr J 


CCTA1A1A 
CoTUlaoU 


EST00821 


1775 


Co 1 WCw7C 


1A7A 


CCTAAOOA 

coi uuyyo 


EST0Q822 


177A 


Co i u i ouj 


10/9 


CCTA1 ATI 


EST0Q823 


1778 


Co 1 UU7 IH 


1A7A 
lOrO 


CCTAA007 

coTooyyf 


ESTQ082& 


1770 

iffy 


Co 1 UU7 1 j 


Ceo m4i 




EST00826 


1781 


F ST 009 17 
co i vuy i r 


EST 00827 


1782 


CO 1 UVJ7 IO 


1A7A 
lOrO 


CCTAAOOO 

csiuuyyy 


e ST00828 


1783 


FQT00O10 
CO 1 UU7 17 


lOf 7 


CCTA4 AXt 
COl UlOJJ 


EST 00829 


SEQ ID* 


CD 1 rr 


1AA1 
lOOl 


CCTA4 AAA 
CoIUlUUU 


EST 00830 






1AA9 
lOOc 


CoIUlOOO 


ESTQ0831 

CO 1 VVOJ 1 


17RA 
1 f Of 


CO 1 UU7£U 


1DD7 

100O 


CCTA1 AAf 


EST 00832 


17H5 


CO 1 UU7C 1 


1AAA 

loot 


CCTA1AA9 

CoTUlUUc 


EST* 


1 TOO 


CO 1 UU7££ 


looo 






1787 

t f Of 


F*T0n02T 

COl 


1AA7 
lOOr 


ccTninn/ 
calUluuH 


EST00833 

bO 1 VUOJJ 


1788 

1 f OO 


Col UU7CN 


lOOr 


coTUlUUo 


EST00834 


1789 


EST00925 


1891 


EST01008 

Uw 1 W 1 WW 


EST00836 


1790 


EST00926 


1893 


EST01642 


EST00837 


1791 


EST00927 


1895 


EST01010 


EST00838 


1792 


EST00928 


1898 


EST01013 


EST00841 


1794 


EST01607 


1899 


EST01014 


EST00842 


1795 


EST00930 


1901 


EST01016 


EST01574 


1796 


EST00931 


1902 


EST01017 


EST00843 


1797 


EST00932 


1905 


EST01020 


EST00844 


1798 


EST00933 


1906 


EST01021 
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1907 EST01022 

1908 EST01023 

1909 EST01024 

1911 EST02694 

1912 EST01Q25 

1913 EST01646 

1915 EST01027 

1916 EST01028 

1917 EST01029 

1918 EST02695 

1919 ESTO1030 

1920 EST01031 

1921 EST01647 

1922 EST01032 

1923 EST01033 

1924 EST01034 

1925 EST01035 

1926 EST01036 

1927 EST01037 
1929 EST01039 
1932 EST01042 

1934 EST01043 

1935 EST01044 

1936 EST01045 

1937 EST01652 

1938 EST01654 

1941 EST01047 

1942 EST01048 

1943 EST01049 

1945 EST01051 

1946 EST02696 

1947 EST01052 

1948 EST01053 

1950 EST01055 

1951 EST01056 

1952 EST01057 
1955 EST01662 

1957 EST01Q59 

1958 EST01060 

1959 EST01061 

1963 EST01063 

1964 EST01064 
1966 EST01065 

1968 EST01067 

1969 EST01068 

1970 EST01666 

1971 EST01069 

1972 EST01070 
. 1975 EST01073 

1976 EST01074 

1978 EST01076 

1979 EST01077 
SEP ID# EST* 

1980 E8T01078 

1981 EST01079 

1983 EST01081 

1984 EST01082 

1985 EST01083 

1986 EST01084 

1988 EST01085 

1989 EST01086 

1995 EST01092 

1996 EST01093 

1998 EST01095 

1999 EST01096 

2002 EST01099 

2003 EST01675 

2005 EST01100 

2006 EST01101 
• 2007 EST01102 

2009 EST01677 

2010 EST01104 

2011 EST01105 

2014 EST01108 

2015 EST01109 



2016 EST01110 

2018 EST01111 

2019 EST01112 

2020 EST01113 

2021 EST01114 

2022 EST01115 

2023 EST01116 

2025 EST01118 

2026 EST01119 

2027 EST01120 

2028 EST01121 

2029 EST01682 

2030 EST01122 
20(33 EST01684 

2034 EST01124 

2035 EST01125 

2036 EST01126 

2037 EST01686 

2038 EST01127 

2039 EST01128 

2040 EST01129 
2042 EST01688 
2045 EST01133 

2047 EST01135 

2048 EST01136 

2049 EST01689 

2050 EST01137 

2052 EST01139 

2053 EST01140 

2054 EST01141 

2055 EST01690 
2057 EST01143 

2061 EST 01 147 

2062 EST02701 

2063 EST01148 

2065 EST01691 

2066 EST01692 

2067 EST01693 

2069 EST01150 

2070 EST01151 
2072 EST01152 

2074 EST01698 

2075 EST01153 

2076 EST02702 

2077 EST01154 

2078 EST01155 

2079 EST01156 

2080 EST01157 

2081 EST01158 

2082 EST01159 

2083 EST01160 

2084 EST01161 

2085 EST01162 

2086 EST01163 

2087 EST01164 

2088 EST01166 
2091 EST01168 
2093 EST01170 

2095 EST 01 701 

2096 E$T01172 

2097 EST01173 

2098 EST01174 

2099 EST01175 

2103 EST0117V 

2104 EST01180 

2107 EST01183 

2108 EST01184 

2109 EST01185 

2110 EST01186 

2111 EST01187 

2112 EST01188 

2113 EST01189 

2114 EST01190 

2115 EST01191 



2118 EST01194 

2119 EST01195 

2122 EST01197 

2123 EST01713 

2124 EST01198 

2125 EST01199 

2126 EST01200 

2127 EST01201 

2129 EST01203 

2130 EST01204 

2132 EST01206 

2133 EST01207 
2135 EST01209 
2137 EST01211 

2139 EST01716 

2140 EST01212 

2142 EST01214 

2143 EST01215 

2147 EST01219 

2148 EST01220 

2151 EST01223 

2152 EST01224 
2154 EST01226 

2156 EST01718 

2157 EST01719 

2158 EST01228 

2159 EST01229 

2160 EST01230 

2162 EST01232 

2163 EST01233 

2164 EST01234 

2165 EST01720 

2166 EST01236 

2167 EST01237 

2169 EST01722 

2170 EST01239 

2171 EST 01 240 

2172 EST01241 
2175 EST01243 

2177 EST01245 

2178 EST01726 

2179 EST01246 

2180 EST01247 

2181 EST01248 
mJM ESI£_ 

2182 EST01249 

2183 EST01250 

2185 EST01252 

2186 EST01253 

2187 EST01727 

2188 EST01254 

2190 EST01728 

2191 EST01256 

2193 EST01258 

2194 EST01729 

2195 EST01259 

2197 EST01261 

2198 EST01730 

2199 EST01262 

2200 EST01731 

2201 EST01263 

2202 EST01732 

2205 EST01735 

2206 EST01736 

2208 EST01267 

2209 EST02717 

2210 EST01268 

2211 EST01269 
2213 EST01271 
2215 EST01273 

2218 EST01274 

2219 EST01275 

2220 EST01740 

2221 EST01741 

2222 EST01276 



2223 EST01742 

2224 EST01277 
2226 EST01280 
2229 EST01281 
2231 EST01746 

2237 EST01288 

2238 EST01289 

2239 EST01290 

2240 EST01291 

2241 EST01747 

2242 EST01292 

2243 EST01293 

2244 EST01294 

2246 EST01295 

2247 EST01296 
2249 EST01298 

2251 E8T01300 

2252 EST01750 

2253 EST01301 

2256 EST02718 

2257 EST013Q3 

2258 EST01754 

2260 EST01305 

2261 EST01755 

2262 EST01306 

2264 EST01308 

2265 EST01309 

2268 EST01311 

2269 EST01312 

2270 EST01313 

2271 EST01314 

2272 EST01762 

2273 EST01315 

2275 EST01316 

2276 E8T01317 

2277 EST01318 

2278 EST01319 

2279 EST01320 

2280 EST01763 

2284 EST01323 
SEQ IDff EST# 

2285 EST01768 

2287 EST01770 

2288 EST01324 

2290 EST01772 

2291 EST01773 

2292 E8T01326 

2293 EST01327 

2294 EST01328 

2295 EST01329 

2296 EST01330 

2298 EST01331 

2299 EST01332 
2301 EST01334 

2304 E5T01780 

2305 EST01336 

2306 EST01337 

2310 EST01341 

2311 EST01342 

2312 EST01343 

2313 EST01344 

2315 EST01346 

2316 EST01782 

2317 EST01347 

2318 EST01348 

2319 EST01349 

2321 EST01350 

2322 EST01351 
Z323 EST017B9 
2325 EST01353 

2327 EST01354 

2328 EST01355 

2329 EST01792 

2330 EST01793 

2331 EST01356 



2332 EST01794 

2333 EST01357 

2335 EST01359 

2336 E8T01360 

2337 EST01361 

2340 EST01802 

2341 EST01364 

2343 EST01366 

2344 EST01367 

2349 EST01372 

2350 EST02708 
2352 EST01374 

2356 EST0137Z 

2357 EST0137B 

2360 EST01381 

2361 EST01382 

2363 EST01384 

2364 EST01385 

2365 EST01386 

2366 EST01387 

2369 EST01811 

2370 EST01390 

2371 EST01391 

2372 EST01392 

2375 EST01815 

2376 EST01395 

2377 EST01396 

2379 EST01398 

2380 EST01399 

2381 EST01400 

2382 EST01401 

2383 EST01402 

2384 EST01403 

2385 EST01816 

2386 EST01404 

2387 EST01405 
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SEQ ID* EST* 

2389 EST01407 

2391. EST01415 

2392 EST01416 

2395 EST0U19 

2397 EST0H21 

2401 EST01424 

2403 EST01425 

2404 EST01426 
2406 EST02713 
2409 EST00273 



SUBSTITUTE SHEET 



WO 93/16178 PCI7US93/01294 

-67- 
EXAMPLE 10 

Functional Groupings of ESTs and Corr esponding Genes 

By matching new human ESTs to known sequences from other 
species, the apparent function of the gene corresponding to 
the EST can be ascertained. The data generated in Example 3 
and 4 have been used to" categorize 127 of the ESTs of the 
present invention, and their corresponding genes, into 
predicted functional groups. (These 127 are ESTs with 
database matches to sequences from other species for which a 
function was known.) Two different grouping schemes have 
been used. 

The first scheme separates the sequences into three 
broad categories: metabolic; regulatory; and structural. 
These groupings are set out in Table 10. 

The second grouping scheme separates the sequences into 
13 specific categories: cell surface proteins; developmental 
control; energy metabolism; kinases and phosphatases ; 
oncogenes; other metabolism-related polypeptides; peptidases 
and peptidase inhibitors; receptors; structural and 
cytoskeletal; signal transduction; transporters; 
transcription, translation, and subcellular localization; and 
transcription factors. These groupings are set out in. Table 
11. 
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Tabla 10: Three-Class Functional Groupings of BSTs 



SQ ID 


EST# 


Group 


Putative Identification 


1834 


EST01620 


M 


AMP deaminase, brain 


97 


EST00289 


M 


Aconitase 


691 


EST00675 


M 


Alcohol dehydrogenase 


2092 


EST01700 


M 


Anion exchanger homolog AE3 


396 


EST01443 


M 


CDPdiacylglycerol- serine 0- phosphatidyl transfers 


1956 


EST01663 


M 


Ca2+- transporting ATPase 2 


1039 


EST02055 


M 


Calcium channel 


2192 


EST01257 


M 


Diacylglycerol kinase, lyraphyocyte 


1441 


EST02477 


M 


Diamine acetyl transferase 


2289 


EST01325 


M 


Fatty acid synthase 


310 


EST00377 


M 


Fo ATPase beta subunit, mitochondrial 


1667 


EST00825 


M 


Gamma -aminobutyric acid transporter 


1412 


EST02446 


M 


Glut amate -aspartate carrier protein 


1020 


EST02034 


M 


Glutaminase 


2326 


EST01791 


M 


Inositol- l f 4 , 5-trisphosphate 3 -kinase 


2173 


EST01724 


M 


Lon protease 


1427 


EST02463 

mm mm* ^ W m* V mw 


M 


Lono - chain -f attv-aeid-CoA liaase 


2226 


EST01744 


M 


MAD (P) + tranflhvdroopnase fB-sneei£ie) 


1566 


EST02609 


M 


Neutrophil oxidase factor 


1681 


EST01573 


M 


Nucleoside dinhosnhate kinase 


2254 


EST01751 


M 


Phosohatidvl inositol -4 ; 5-bisnhosohate nhosnhodie 


93 


EST00287 


M 


Process incr enhancina 1 orotein 






M 


riuuuiiuuue 'bxcavayc cii a yuic 






M 




1654 


EST01572 


M 


Protoehloronhvllide reductase 


J o 


EST00374 


M 


PNA nolvmerase TT -fifch suhunit fPP026^ 


1715 


EST015B3 


M 


Pi hnfiranal nrn^tf^Ti T.1 Oa 

AMlUDUUlal tr W w X UAQQ 


1856 


EST01627 


M 


Pihosomal nrotein Til a 


1974 


EST01667 


M 


Ribosonal orotein 1*3 


• 301 


EST00300 


M 


Riboflornal nrotein T.3 0 


22 


EST0M01 


M 




2402 


EST01826 


M 


Ribosomal nrOtein S10 


463 


EST01459 


M 


Pihosomal nrotein YTj1_Q 


2073 


EST01697 


M 


Succinate dehvdrocrenase f lavonrotein 


2138 


EST01715 


M 


Succinate dehydrogenase f lavoprotein 


1771 


EST01601 


M 


Thiosulfate sulfur transferase (rhodanese) 

* m>m^m\rmm \m&m mm 1^ mW wm+9m mm %&mm mm mm ImAH » mm %m\mm %tf ^ ^» w f 


2121 


EST01711 


M 


Valine- tRNA liaase 


1726 


EST01588 


M 


XPR2 alkaline extracellular protease 


913 


EST01913 

iiW * w mw mm- mm* 


M 


Clathrin coat assemblv nrotein AP50 homolOQ 


1035 


EST02051 

mm 9 m* W At W «^ «fc 


M 


Jl nrotein 


969 


EST01982 


R 


ADP-ribosylation factor l 


1126 


EST02146 


R 


Calbindin D28 


1910 


EST01645 


R 


Calmodulin 


485 


EST01466 


R 


Calmodulin- dependent nrotein kinase, tvoe II . be 


2302 


EST01779 


R 


Discs -lar ere tumor auonreasor 


188 


BST00256 

mmmm J* W v w *mf W 


R 


Enhancer of solit 


1229 


EST02258 


R 


KTTP oroteiri 


993 


EST02007 


R 


Kinase 5 orotein 


2282 


EST01764 


R 


Latnin B r e c en to r 


EQ ID 


EST# 


Group 


Putative Tdent*if {rAt 1 ion 


161 


EST00247 


R ' 


MARCKS (myristoylated alanine- rich protein kinas 


769 


EST00734 


R 


MARCKS homolog 


1386 


EST02418 


R 


MARCKS homolog 


227 


EST00259 


R 


Notch/Xotch 


952 


EST01961 


R 


Notch/Xotch 


1395 


EST02429 


R 


Nuclear factor l-l ike protein (NFl) 


2353 


EST01806 


R 


Prohibitin 


1069 


EST02087 


R 


Protein kinase C, zeta 


1933 


EST01650 


R 


Protein phosphatase 2A beta subunit 
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202 


EST00298 


R 


Protein- tyrosine phosphatase LRP 


1478 


EST02515 


R 


Rab5 


1408 


EST02442 


R 


Seven in absentia 


300 


EST00232 


R 


Transforming protein (dbl) 


1147 


EST02169 


R 


Tyrosine kinase 


1348 


EST02378 


R 


cAMP- dependent protein kinase inhibitor 


1931 


EST01041 


R 


cAMP - regulated phosphoprotein 


1413 


EST02447 


R 


cAMP- specific phosphodiesterase 


37 


EST00038 


R 


ras p21-like small GTP-binding protein (srag GDS) 


102 


EST00248 


R 


rho H12/ ARH12 


299 


EST00249 


R 


smg p25A GDP dissociation inhibitor 


189 


EST00282 


R 


trkB 


1332 


EST02362 


R 


GA binding protein, beta subunit 


1277 


EST02306 


R 


Bib protein 


43 


EST00371 


R 


Maternal G10 protein 


1704 


EST01580 


R 


Myeloid differentiation primary response gene My 


346 


EST01828 


R 


Otd homeotic protein 


187 


EST00152 


R 


Wilm's tumor- related protein 


249 


EST00275 


R 


Zinc Finger Proteins 


413 


EST01446 


R 


Zinc Finger Proteins 


469 


EST01460 


R 


Zinc Finger Proteins 


833 


EST01560 


R 


Zinc Finger Proteins 


1230 


EST02259 


R 


Zinc finger proteins 


1496 


EST02534 


R 


Zinc finger proteins 


2324 


EST01352 


R 


Zinc Finger Proteins 


208 


EST00250 


S 


60K filarial antigen 


2320 


EST01784 


S 


6 OK filarial antigen 


251 


EST00370 


S 


Act in, other 


2146 


EST01218 


S 


Ac tin, other 


248 


EST00271 


S 


Actinin, alpha 


891 


EST01891 


S 


Ac tin in, alpha 


1500 


EST02538 


S 


Actinin, alpha 






c 


Agrin 


1852 


EST01625 


S 


Agrin 


1965 


EST01664 


s 


Amyloid A4 


2068 


EST01694 


s 


Amyloid A4 


2408 


SST00244 


s 


Amyloid A4 


1880 


EST01634 


s 


Axonal glycoprotein TAG-l 


2004 


EST01676 


s 


Cofilin 


650 


EST00642 


s 


Dilute (myosin heavy chain) 


2217 


EST01738 


s 


Gelation factor ABP-280* 


1885 


EST01639 


s 


Histocompatibility antigen modifier l 


77 


EST00257 


s 


Kinesin 


S6Q ID 


EST# 


Group 


Putative Identification 


78 


EST00258 


S 


Kinesin 


2245 


EST01748 


S 


Kinesin 


313 


EST00276 


S 


Lysosomal membrane glycoprotein 1 (LAMP-1) 


223 


EST00368 


s 


Microtubule-associated protein IB 


824 


EST01865 


s 


Microtubule -associated protein IB 


2032 


EST01683 


s 


Microtubule-associated protein IB 


2017 


EST01678 


s 


Milk fat globule membrane protein 


1567 


EST02610 


s 


Neural cell adhesion molecule LI 


506 


EST01471 


s 


Keuraxin 


2368 


EST01389 


s 


Radial spoke protein 3 


951 


EST01960 


s 


Spectrin, beta 


2089 


EST01699 


s 


Sperm membrane protein 


653 


EST01512 


s 


Tubulin, alpha 


311 


EST00270 


s 


Tubulin, beta 


594 


EST01490 


s 


Tubulin, beta 


757 


EST01542 


s 


Tubulin, beta 


1245 


EST02274 


s 


Tubulin, beta 


1589 


EST02634 


s 


Tubulin, beta 


1466 


EST02505 


s 


Matrin 3 
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1371 EST02402 S Talin 
1701 EST00853 S Unc-104 



Group Key: M: Metabolic, R: Regulatory, S: Structural 
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Table 11: Thirteen-Class Functional Groupiiigs of ESTs 



SEOID 


EST/? 


Group 


Putative Identification 


208 


EST00250 


CS 


60K filarial antigen 


2320 


EST01784 


CS 


60K filarial antigen 


1965 


EST01664 


CS 


Amyloid A4 


2068 


EST01694 


CS 


Amyloid A4 


2408 


EST00244 


CS 


Amyloid A4 


1880 


EST01634 


CS 


Axonal glycoprotein TAG-1 


1885 


EST01639 


CS 


Histocompatibility antigen modifier 1 


313 


EST00276 


CS 


Lysosomal membrane glycoprotein 1 (LAMP-1) . 


2017 


EST01678 


CS 


Milk tat globule membrane protein 


1567 


ESTQ2610 


CS 


Neural cell adhesion molecule LI 


2368 


EST01389 


CS 


Radial spoke protein 3 


2089 


BST01699 


CS 


Sperm membrane protein 


1277 


EST02306 


DC 


Bib protein 

* 


188 


EST0O256 


DC 


Enhancer of split 


43 


EST00371 


DC 


Maternal G10 protein 


1704 


EST01580 


DC 


Myeloid differentiation primary response gene MyDl 


. 227 


EST00259 


DC 


Notch/Xotch 


952 


EST01961 


DC 


Notch/Xotch 


346 


EST01828 


DC 


Orthodentical homeotic protein 


1408 


EST02442 


DC 


Seven in absentia 


97 


EST00289 


EM 


Aconitase 


310 


EST00377 


EM 


Fo ATPase beta subunit, mitochondrial 


485 


EST01466 


KP 


CalmoduluMiependent protein kinase, type n, beta 


993 


EST02007 


KP 


Kinase S protein 


1069 


EST02087 


KP 


Protein kinase C, zeta 


1933 


EST01650 


KP 


Protein phosphatase 2A beta subunit 


202 


EST00298 


KP 


Protein-tyrosine phosphatase LRP 


1348 


EST02378 


KP 


cAMP-dependent protein kinase inhibitor 


2302 


EST01779 


OG 


Discs-large tumor suppressor 


2353 


EST01806 


OG 


Prohibitin 


1478 


BST02515 


OG 


Rab5 


300 


EST00232 


OG 


Transforming protein (dbl) 


37 


EST00038 


OG 


ras p21-like small GTP-binding protein (smg GDS) 


102 


EST00248 


OG 


rho H12/ ARH12 


1834 


EST01620 


OM 


AMP deaminase, brain 


691 


EST00675 


OM 


Alcohol dehydrogenase 


396 


EST01443 


OM 


CDPdiacylglycerol-serine O-phosphatidyltransferase 


2192 


EST01257 


OM 


Diacylglycerol kinase, lymphyocyte 


1441 


EST02477 


OM 


Diamine acetyltransferase 


2289 


EST01325 


OM 


Fatty acid synthase 


1020 


EST02034 


OM 


Glutaminase 


2326 


EST01791 


OM 


Inositol-l,4,5-trisphosphate 3-kinase 


1427 


EST02463 


OM 


Long-chain-fatty-acid-CoA ligase 


2226 


EST01744 


OM 


NAD(P)+ transhydrogenase (B-specific) 


1566 


EST02609 


OM 


Neutrophil oxidase factor 


1681 


EST01573 


OM 


Nucleoside diphosphate kinase 
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SEQID 


ESTft 


Group 


Putative Identification 


2254 


EST01751 


OM 


Phosphatidylinositol-4,5-bisphosphatephosphodiest 


16S4 


EST01572 


OM 


Protochlorophyllide reductase 


2073 


EST01697 


OM 


Succinate dehydrogenase flavoprotein 


2138 


EST01715 


OM 


Succinate dehydrogenase flavoprotein 


1771 


EST01601 


OM 


Thiosulfate sulfurtransferase (rhodanese) 


2173 


EST01724 


PI 


Lon protease 


2297 


EST01775 


PI 


Prohormone cleavage enzyme 


9 


EST00376 


PI 


Prolyl endopeptidase 


1726 


EST01588 


PI 


XPR2 alkaline extracellular protease 


1147 


EST02169 


PP 


Tyrosine kinase 


2282 


EST01764 


RT 


Lamin B receptor 


189 


EST00282 


RT 


trkB 


251 


EST00370 


sc 


Actin, other 


2146 


EST01218 


SC 


Actin, other 


248 


EST00271 


SC 


Actinin, alpha 


891 


EST01891 


SC 


Actinin, alpha 


1500 


EST02538 


SC 


Actinin, alpha 


132 


EST00110 


SC 


Agrin 


1852 


EST01625 


SC 


Agrin 


2004 


EST01676 


SC 


Cofilin 


650 


EST00642 


SC 


Dilute (myosin heavy chain) 


2217 


EST01738 


SC 


Gelation factor ABP-280 


77 


EST00257 


SC 


Kinesin 


78 


EST00258 


SC 


Kinesin 


2245 


EST01748 


SC 


Kinesin 


1468 


EST02505 


SC 


Matrin3 


223 


EST00368 


SC 


Microtubule-associated protein IB 


824 


EST01865 


SC 


Microtubule-associated protein IB 


2032 


EST01683 


SC 


Microtubule-associated protein IB 


506 


EST01471 


SC 


Neuraxin 


951 


EST01960 


SC 


Spectrin, beta 


, 1371 


EST02402 


SC 


Talin 


653 


EST01512 


SC 


Tubulin, alpha 


311 


EST00270 


SC 


Tubulin, beta 


594 


EST01490 


SC 


Tubulin, beta 


757 


EST01542 


SC 


Tubulin, beta 


1245 


EST02274 


SC 


Tubulin, beta 


1589 


EST02634 


SC 


Tubulin, beta 


1701 


EST00853 


SC 


Unc-104 


969 


EST01982 


ST 


ADP-ribosylation factor 1 


1126 


EST02146 


ST 


CalbindinD28 


1910 


EST01645 


ST 


Calmodulin 


161 


EST00247 


ST 


MARCKS (myristoylated alanine-rich protein kinase 


769 


EST00734 


ST 


MARCKS homolog 


1386 


EST02418 


ST 


MARCKS homolog 


1931 


EST01041 


ST 


cAMP-regulated phosphoprotein 


1413 


EST02447 


ST 


cAMP-specific phosphodiesterase 


299 


EST00249 


ST 


smg p25A GDP dissociation inhibitor 
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SEQID 




Group 


Putative Identification 


2092 


EST01700 


TP 


Anion exchanger homoiog AE3 


1956 


EST01663 


TP 


Ca2+-transporting ATPase 2 


1039 


ESTQ2055 


TP 


Calcium channel 


1667 


EST00825 


TP 


Gamma-aminobutyric acid transporter 


1412 


EST02446 


TP 


Glutaxnate-aspartate carrier protein 


913 


EST01913 


TT 


Clathrin coat assembly protein APS0 homoiog 


103S 


EST02051 


TT 


Jl protein 


93 


EST00287 


TT 


Processing enhancing protein 


38 


EST00374 


IT 


RNA polymerase II 6th subunit (RP026) 


1715 


EST01583 


TT 


Ribosomal protein Ll8a 


1856 


EST01627 


TT 


Ribosomal protein Lla 


1974 


EST01667 


TT 


Ribosomal protein L3 


301 


EST00300 


* 

TT 


Ribosomal protein L30 


22 


EST00301 


TT 


Ribosomal protein SIO 


2402 


EST01826 


TT 


Ribosomal protein SIO 


463 


EST01459 


TT 


Ribosomal protein YLIO 


2121 


EST01711 


TT 


Valine-tRNA ligase 


1332 


EST02362 


TX 


GA binding protein, beta subunit 


1229 


EST02258 


TX 


KUP protein 


1395 


EST02429 


TX 


Nuclear factor l-like protein (NFl) 


- 187 


EST00152 


TX 


Wilm's tumor-related protein 


249 


EST00275 


TX 


Zinc Finger Proteins 


413 


EST01446 


TX 


Zinc Finger Proteins 


469 


EST01460 


TX 


Zinc Finger Proteins 


833 


EST01560 


TX 


Zinc Finger Proteins 


1230 


EST02259 


TX 


Zinc finger proteins 


1496 


EST02534 


TX 


Zinc finger proteins 


2324 


EST01352 


TX 


. Zinc Finger Proteins 



Group Key: CS: Cell Surface, DC: Developmental Control, EM: Energy Metabolism, KP: Kinases 
and Phosphatases, OG: Oncogenes, OM: Other Metabolism, PI, Peptidases and Peptidase Inhibitors, 
RT: Receptors, SC: Structural and Cytoskeletal, ST: Signal Transduction, TP: Transporters, TT: 
Transcription, Translation, and Subcellular Localization, TX: Transcription Factors. 
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EXAMPLE 11 

cDNA Libraries Generated From Specific Genomic DNA 
bv Exon Expression & Amplification 

5 

Exon amplification was used to express potential exons 
from genomic DNA in a recombinant vector that contains some 
of the signals necessary for splicing. If an exon is present, 
in the proper orientation in the vector, that exon will be 

10 spliced in a mammalian cell and* will become part of the mRNA 
of that cell. The exon splice-product can be purified from 
other mRNA in the cell by conversion of the mRNA to cDNA and 
selective amplification of the recombinant splice-product 
cDNAs. Cosmid DNA from human chromosome 19ql3.3 was digested 

15 with BamHI or BamHI/Bglll restriction enzymes, the fragments 
generated were collected and size specifically cloned into an 
expression vector (Buckler, et al. Proc. Nat'l. Acad. Sci. 
USA, 88:4005-4009 (1991)). After transfection by 

elect roporat ion of these constructs into COS cells, RNA. 

20 transcripts were generated using the SV40 early promoter and 
a polyadenylation signal derived from SV40 both present in 
the expression vector. When a fragment of genomic DNA 
contains an entire exon with flanking intron sequence in the 
sense orientation, the exon should be retained in the mature 

25 poly (A) + cytoplasmic RNA. Therefore, the mRNA was used as 
template for cDNA synthesis using reverse transcriptase and 
vector-priming. Subsequently, the cDNAs were amplified, by 
vector-priming using PCR. A fraction of this first PCR 
product was reamplif ied using internal vector-primers 

30 containing terminal cloning sites . These products were end- 

repaired with T4 DNA polymerase, digested with the 
appropriate restriction enzymes, gel purified and cloned into 
pBluescript vectors. The constructs were transfected into 
XLl-Blue competent cells and plated on LB/X- 

35 gal/IPTG/ampicillin plates. White colonies were selected and 
expanded to prepare DNA templates as described in Example 2 . 
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When multiple cosmids or YAC clones were used as the source 
DNA, a pool of specific expressed exons was obtained as a 
cDNA library. The EST/ cDNAs sequenced from this specific 
library are disclosed herein as SEQ ID NOS: 2412-2417. 

EXAMPLE 12 

' SCR Amplification from Predicted Racona 

Computational analyses can be applied to genomic DNA 
sequences to predict protein coding regions. The coding 
region prediction program CRM (E. Uberbacher and R. Mural, 
Proc. Natl. Acad. Sci. USA 88:11261-5 (1991)) finds open 
reading frames and classifies them according to their 
probability of being coding regions. These regions are 
subsequently examined using the GM program (C. Fields and C. 
Soderlund, Camp. Applic. Biosci. 6: 263, 1990), which 
predicts intron-exon structure. PCR primers are then 
designed to amplify the predicted exons and used to test 
human cDNA libraries (for example, fetal brain or placental 
libraries) for the presence of these putative exons using a 
PCR assay. 

This strategy has been successfully applied in two large 
scale genomic sequencing projects, the Huntington's locus of 
human chromosome 4pl6.3 (McCombie, et al. , submitted) and 
human chromosome locus 19ql3.3 (Martin-Gallardo, et al., 
submitted) . Sequences from eleven predicted exons from 
chromosome 4 were present in tested cDNA libraries, 
indicating that this region has at least two and probably 
three expressed genes. In one case, the method resulted in 
an amplification product which spanned two predicted exons. 
(SEQ ID NO: 2411.) When sequenced, this PCR product 
indicated the presence of the two exons from which the 
primers were initially chosen, as well as an intervening exon 
which was also predicted by the CRM program, but not the 
intervening genomic sequences. In a similar fashion, the 
presence of the two predicted genes in the chromosome 19 
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sequence was confirmed by sequencing PGR products. SEQ ID NO 
2410, includes a partial exon of one of these genes. 

EXAMPLE 13 

Complete Sequence of EST Clone Inserts 

There are a number of methods known to those with skill 
in the art of molecular biology, to obtain sequence 
information from the cDNAs corresponding to the EST 
sequences. Procedures for these methods are provided in 
Basic Methods in Molecular Biolocrv (David et al . supra ) . One 
way to acquire more information about the cDNA from which an 
EST was derived is to sequence the remainder of the cDNA 
clone. The complete sequence of the inserts of four EST 
clones (representing SEQ ID NOs 188, 189, 223, and 227) was 
determined using Exonuclease III deletions. Briefly, EST 
clones were digested with the restriction enzymes Sail and 
Kpnl or PstI and BamHI (for deletions from the Forward primer 
and Reverse primer ends of the insert, respectively). The 
Kpnl and PstI enzymes leave 3/ sticky ends following 
digestion, which Exonuclease III is unable to bind. This 
results in unidirectional deletions into the cDNA insert 
leaving the vector sequence undisturbed. After addition of 
Exonuclease III to the Forward and Reverse deletion 
reactions, aliquots of the reaction were removed at defined 
time intervals and the reaction was stopped to prevent 
further deletion. SI nuclease and. Klenow DNA polymerase were 
added to create blunt ended fragments suitable for ligation. 

Samples for each time point was purified by 
electrophoresis through an agarose gel and religated. Two to 
four representative clones from each time point in each 
direction were sequenced to give between 200 and 400 base 
pairs of sequence data . Careful selection of deletion 
conditions and time points allow a deletion series of 
approximately 100-200 base pairs difference in length at each 
consecutive time point. Sequence fragments were reassembled 
into a redundant contiguous sequence using the INHERIT 
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software from Applied Biosystems, Inc. (Foster City, CA) . In 
this way, the complete insert from these four cDNA clones was 
sequenced on both strands to an average redundancy between 
three and four (each base was sequenced between three and 
four times, on average) . Those complete insert sequences are 
disclosed herein as SEQ ID 2418, 2419, 2420, and 2421, 
corresponding to original ESTs with SEQ ID 223, 189, 227, and 
188 , respectively . 

EXAMPLE 14 

Determining Reading Frame, Orientation, Coding Regions t 
ESTs and Complete cDN & RnrjiMmnaa 

Once the complete cDNA sequence has been determined in 
accordance with Example 13, the reading frame, orientation, 
and coding regions are determined by computer techniques. 
(The complete coding region is considered to be the largest 
open reading frame from a methionine to a stop codon • ) 

Specifically, the CRM program on the GRAIL server is 
used as explained in Example 9 to determine probable coding 
regions. This information is supplemented by location of 
start and stop codons. Where possible, the results of the 
CRM analysis are validated by comparison of the cDNA sequence 
to known sequences using database matching, in accordance 
with Examples 3 and 4 . If a match of 50% (or even less) is 
found in any particular reading frame and orientation, this 
serves to verify corresponding CRM results. Alternatively, 
database matches can be used to determine reading frame and 
orientation without use of the CRM program. Of course, if 
the cDNA is derived from a directional library, the probable 
orientation is already known. 

EXAMPLE 15 

Preparation of FCR Primers and Amplification of DNA 

The EST sequences and the corresponding cDNA sequences 
and genomic sequences may be used, in accordance with the 
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present invention, to prepare PCR primers for a variety of 
applications. The PCR primers are preferably at least 15 
bases, and more preferably at least 18 bases in length. The 
procedure of Example 5 is repeated using the desired EST, or 
5 using the corresponding cDNA or genomic DNA sequence from 
Example 13. It is preferred that the primer pairs have 
approximately the same G/C ratio, so that ■ melting 
temperatures are approximately the same. When screening 
cDNA, introns are of no concern; however, when screening 
10 genomic DNA, primers should be selected to avoid reading 
across introns, which usually are too large to amplify. The 
PCR primers and amplified DNA of this Example find use in the 
Examples that follow. 

15 EXAMPLE 16 

Forensic Matching bv DNA Sequencing 

20 In one exemplary method, DNA samples are isolated from 

forensic specimens of, for example, hair, semen, blood or 
skin cells by conventional methods. A panel of PCR primers 
derived from a number of the sequences of Example 1, 2, 11, 
12 and/or 13 is then utilized in accordance with Example 12 

25 1 to obtain DNA of approximately 100-200 bases in length from 
the forensic specimen. Corresponding sequences are obtained 
from a suspect. Each of these identification DNAs is then 
sequenced, and a simple database comparison determines the 
differences, if any, between the sequences from the suspect 

30 and those from the sample. Statistically significant 

differences between the suspect's DNA sequences and those 
from the sample conclusively prove a lack of identity. This 
lack of identity can be proven, for example, with only one 
. sequence. Identity, on the other hand, should be 

35 demonstrated with a large number of sequences, all matching. 

Preferably, a minimum of 50 statistically identical sequences 
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of 100 bases in length are used to prove identity between the 
suspect and the sample. 

EXAMPLE 17 

Poaitive Identification bv DMA Sequencing 

-« 

The technique outlined in the previous example may also 
be used on a larger scale to provide a unique fingerprint- 
type identification of any individual. In -this technique, 
primers are prepared from a large number of sequences from 
Examples 1, 2, 11, 12 and/or 13. Preferably, 20 to 50 
different primers are used. These primers are used to obtain 
a corresponding number of PCR-generated DNA segments from the 
individual in question in accordance with Example 15. Each 
of these DNA segments is sequenced, using the methods set. 
forth in Example 1. The database of sequences generated 
through this procedure uniquely identifies the individual 
from whom the sequences were obtained. The same panel of 
primers may then be used at any later time to absolutely 
correlate tissue or other biological specimen with that 
individual . 

EXAMPLE 18 
Southern Blot ygregsjc identification 

The procedure of Example 17 is repeated to obtain a 
panel of from 10 to 2000 amplified sequences from an 
individual and a specimen. This PCR-generated DNA is then 
digested with one or a combination of, preferably, four base 
specific restriction enzymes. Such enzymes are commercially 
available and known to those of skill in the art. After 
digestion, the resultant gene fragments are size separated in 
multiple duplicate wells on an agarose gel and transferred to 
nitrocellulose using Southern blotting techniques well known 
to those with skill in the art. For a review of Southern 



WO 93/16178 



PCT/US93/01294 



-80- 

blotting see Davis et al. ( Basic Methods in Molecular 
Biology , 1986, Elsevier Press, pp 62-65) . 

A panel of ESTs or complete cDNA sequences from Examples 
1,2, and/or 13, or fragments thereof of at least 15 bases, 
5 are radioactively or colorimetrically labeled using end- 
labeled oligonucleotides derived from the ESTs, nick 
translated sequences or the like using methods known in the 
art and hybridized to the Southern blot using techniques 
known in the art (Davis et al., supra ) . Preferably, at least 

10 5 to 10 of these labeled probes are used, and more preferably 
at least about 20 or 30 are used to provide a unique pattern. 
The resultant bands appearing from the hybridization of a 
large sample of ESTs will be a unique identifier. Since the 
restriction enzyme cleavage will be different for every 

15 individual, the band pattern on the Southern blot will also 
be unique. Increasing the number of EST probes will provide 
a statistically higher level of confidence in the 
identification since there will be an increased number of 
sets of bands used for identification. 

20 

EXAMPLE 19 
Dot Blot Identification Procedure 

25 Another technique for identifying individuals using the 

sequences disclosed herein utilizes a dot blot hybridization 
technique . 

Genomic DNA is isolated from nuclei of subject to be 
identified. Oligonucleotide probes of approximately 3 0 bp in 
3 0 length were synthesized that correspond to sequences from the 

. ESTs. The probes are used to hybridize to the genomic DNA 
through conditions known to those in the art . The 
oligonucleotides are end labelled with P 32 using 
polynucleotide kinase (Pharmacia). Dot Blots are created by 
35 spotting about 50 ng cDNA of at least 10, preferably at least 

50 sequences corresponding to a variety of the Sequence ID 
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NOs provided in Table 7 onto nitrocellulose or the like using 
a vacuum dot blot manifold (BioRad, Richmond California) . 
The nitrocellulose filter containing the EST clone sequences 
is baked or UV linked to the filter, prehybridized and 
5 hybridized with labeled probe using techniques known in the 
art (Davis et al . supra ) . The M P labeled DNA fragments are 
sequentially hybridized with successively stringent* 
conditions to detect minimal differences between the 30 bp 
sequence and the DNA. Tetramethylammonium chloride is useful 
10 for identifying clones containing small numbers of nucleotide 
mismatches (Wood et al., Proc. Natl. Acad. Sci. USA 
82 (6) :1585-1588 (1985) which is hereby incorporated by 
reference. A unique pattern of dots distinguishes one 
individual from another individuals. 

15 

EXAMPLE 20 

Alternative "Fingerprint" Identification T«^>m4ipift 

20 EST sequences and the corresponding complete cDNA 

sequences can be used to create a unique fingerprint for an 
individual. Thus pools of EST sequences can be used in 
forensics, paternity suits or the like to differentiate one 
individual from another. 

25 Entire EST sequences can be used; similarly 

oligonucleotides can be prepared from EST sequences. In this 
example, 20-mer oligonucleotides are prepared from 200 EST 
sequences using commercially available oligonucleotide 
services such as Oligos Etc., Wilsonville, OR. Patient cell 

30 samples are processed for DNA using techniques well known to 
those with skill in the art. The nucleic acid is digested 
with restriction enzymes EcoRI and Xbal . Following 
digestion, samples are applied to wells for electrophoresis. 
The procedure, as known in the art, may be modified to 

35 accommodate polyacrylamide electrophoresis, however in this 
example, samples containing 5 ug of DNA are loaded into wells 
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and separated on 0.8% agarose gels. The gels are transferred 
using Southern blotting techniques onto nitrocellulose. 

10 ng of each of the oligos are pooled and end-labeled 
with P 32 . The nitrocellulose is prehybridized with blocking 
5 solution and hybridized with the labeled probes. Following 
hybridization and washing, the nitrocellulose filter is 
exposed to X-Omat AR X-ray film. The resulting hybridization 
pattern will be unique for each individual. 

It is additionally contemplated within this example that 
10 the representative number of EST sequences can be varied for 
additional accuracy or clarity. 

EXAMPLE 21 

15 . Identif ication of genes associated with hereditary diseases 

This example illustrates an approach useful for the 
association of EST sequences with particular phenotypic 
characteristics- In this example, a particular EST is used 

20 as a test probe to associate that EST with a particular 
phenotypic characteristic. 

An EST clone corresponding to EST01643, (SEQ ID NO 1894) 
.maps to a gene rich region of chromosome 6. EST clone 
HHCMH89, from which E.ST01643 was derived, was mapped to: 

25 chromosome 6p21 by Dr. Julie Korenberg of UCLA/ Cedar Sinai 

Hospital using FISH. A search of Mendelian Inheritance in 
Man (supra) revealed 6p21 to be a very gene rich region 
containing several known genes and several diseases for which 
genes have not been identified. The cDNA encoded by EST 

.30 clone HHCMH89 thus becomes an immediate candidate for each of 

these genetic diseases. 

Cells from patients with these diseases are isolated and 
expanded in culture. PCR primers from the EST sequences are 
used to screen genomic DNA and RNA or cDNA from the patients. 

35 ESTs that are not amplified in the patients can be positively 

associated with a particular disease by further analysis. 
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EXAMPLE 22 

Identification of a gene associated with 
Ancrelman's disease 

5 

Angelman's disease (AD) is characterized by deletions 
on the long arm of chromosome 15 (15qllql3) (Williams et al. 
Am. J. Med. Genet. 32:339-345 (1989) hereby incorporated by 
reference) . The symptoms of the disease include 

10 developmental delay, seizures, inappropriate laughter and 
ataxic movements* These symptoms suggest that the disorder 
is a neurologic deficiency. This prophetic example 
illustrates how ESTs, preferably obtained from a cDNA library 
from human brain, may be used in identifying the defective 

15 gene or genes associated with Angelman's Disease. (The 
example is based on analogous work with genomic DNA, rather 
than cDNA and ESTs, in identifying the genetic defect 
associated with Angelman's Disease.) This example also 
illustrates how EST sequences may generally be used for 

20 identifying gene sequences associated with an inherited 
disease that is mapped to a chromosome location. 

ESTs are screened using techniques described in Example 
5 and Example 7 to identify those ESTs that localize to the 
long arm of chromosome 15 and preferably localize to 

25 chromosome 15 bands 15qllql3 from normal. patients. ESTs that 
bind to the long arm of chromosome 15 are hybridized to 
chromosome 15 from AD patients. These studies are. 
preferrably performed using either fluorescence in situ 
hybridization or using somatic cell hybrids that contain 

30 fragments from the long arm of chromosome 15 from AD 
patients. Those chromosome 15-specific ESTs that do not map 
to chromosome 15 from AD patients are useful as markers for 
Angelman's Disease and can be incorporated into diagnostics 
for genetic screening. These ESTs are associated with 

35 chromosome deletions present in Angelman's disease. 

Identification of the gene associated with these AD negative 
ESTs and an analysis of the polypeptides encoded by the genes 
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f rom normal patients is essential for providing gene or other 
therapies for AD patients. 

Genetic diseases are not always accompanied by gene 
deletions. Therefore, it is also important to use the ESTs 
5 that bind to bands 15qllql3 from AD patients as tools to 
identify the polymorphisms present within the disease 
population, Restriction fragment length polymorphism (RFLP) 
analysis can be performed on patient cells from AD disease or 
from somatic cell hybrids created using the long arm of 

10 chromosome 15. For a review of RFLP techniques see Donis- 
Keller et al. (Cell 51:319-337 (198?) hereby incorporated by 
reference) . DNA is isolated from the somatic cell lines or 
from cells from AD patients. The DNA is digested with one or 
more restriction enzymes according to techniques of Donis- 

15 Keller et al. The resulting fragments are separated by gel 
electrophoresis, denatured, transferred to nitrocellulose and 
hybridized with the selected radio-labeled ESTs that localize 
to the region of interest . The autoradiographic pattern is 
compared both to a number of AD patients and to normal 

20 patients. Common patterns of EST hybridization in AD 
patients that are not present in normal patients indicates 
that the genes associated with these ESTs are candidate genes 
affected by AD. 

cDNA libraries are prepared from the somatic cell 

25 hybrids from AD patients. Libraries are prepared using 
Lambda Zap. II Library Kits (Stratagene, La Jolla, California) 
or other commercially available library kits. The ESTs of 
interest are used as probes to identify those bacterial 
colonies carrying genes corresponding to the EST probes. 

30 Positive clones are sequenced and the sequences are compared 

to homologous gene sequences derived from normal patients. 

Alterations, including deletions and substitutions, 
within gene sequences, associated with bands 15qllql3, are 
thus positively identified and associated with AD disease . 

35 Wagstaff et al. were able to identify deletions and 

substitutions in sequences encoding the GABA A receptor 
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protein subunit from patients with Angelman 1 s disease (Am. J. 
Hum. Genet. 49:330-337, (1991)). It is likely that other 
genes will additionally be associated with the disease. 

* 5 EXAMPLE 23 

Preparation and Use of Antisense Oligon ucleotides 

Antisense UNA molecules are known to be useful for 

10 -regulating translation within the cell. Antisense RNA 
molecules can be produced from EST sequences or from the 
corresponding gene sequences. These antisense molecules can 
be used as diagnostic probes to determine whether or not a 
particular gene is expressed in a cell. Similarly, the 

15 .antisense molecules can be used as a therapeutic to regulate 
gene expression once the EST is associated with a particular- 
disease (see Example 22) . 

The antisense molecules are obtained from a nucleotide 
sequence by reversing the orientation of the coding region 

20 wi,th regard to the promoter. Thus, the antisense RNA is 
complementary to the corresponding tnRNA. For a review of 
antisense design see Green et al., Ann. Rev. Biochem. 55:569- 
597 (1986), which is hereby incorporated by reference. The 
antisense sequences can contain modified sugar phosphate 

25 backbones to increase stability and make them less sensitive 
to RNase activity. Examples of the modifications are 
described by Rossi et al., Pharmacol. Ther. 50 (2) : 245-254 , 
(1991) . 

Antisense molecules are introduced into cells that 
.30 express the gene corresponding to the EST of interest in 

culture. In a preferred application of this invention, the 
polypeptide encoded by the gene is first identified, so that 
the effectiveness of antisense inhibition on translation can 
be monitored using techniques that include but are not 
35 limited to antibody-mediated tests such as RIAs and ELISA, 

functional assays, or radiolabelling. The antisense molecule 
is introduced into the cells by diffusion or by transfection 
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procedures known in the art. The molecules are introduced 
onto cell samples at a number of different concentrations 
preferably between lxlO" 10 M to lxl(T*M. Once the minimum 
concentration that can adequately control translation is 
5 identified, the optimized dose is translated into a dosage 

suitable for use iri vivo* For example, an inhibiting 
concentration in culture of lxl(T 7 translates into a dose of 
approximately 0.6 mg/kg bodyweight. Levels of 

oligonucleotide approaching 100 mg/kg bodyweight or higher 

10 may be possible after testing the toxicity of the 
oligonucleotide in laboratory animals. 

The antisense can be introduced into the body as a bare 
or naked oligonucleotide, oligonucleotide encapsulated in 
lipid, oligonucleotide sequence encapsidated by viral 

15 protein, or as oligonucleotide contained in an expression 
vector such as those described in Example 25. The antisense 
oligonucleotide is preferably introduced into the vertebrate 
by injection. It is additionally contemplated that cells 
from the vertebrate are removed, treated with the antisense 

20 oligonucleotide, and reintroduced into the vertebrate. It is 
further contemplated that the antisense oligonucleotide 
sequence is incorporated into a ribozyme sequence to enable 
the antisense to bind and cleave its target. For technical 
applications of ribozyme and antisense oligonucleotides see 

25 Rossi et al . 

EXAMPLE 24 

Preparation and use of Triple Helix Probes 

30 

Triple helix oligonucleotides are used to inhibit 
transcription from a genome. They are particularly useful 
for studying alterations in cell activity as it is associated 
with a particular gene. The EST sequences or complete 
35 sequences of the present invention or, more preferably, a 

portion of those sequences, can be used to inhibit gene 
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expression in individuals having diseases associated with a 
particular gene. Similarly, a portion of the EST or 
corresponding gene sequence can be used to study the effect 
of inhibiting transcription of a particular gene within a 
5 cell. Traditionally, homopurine sequences were considered 
the most useful. However, homopyrimidine sequences can also 
inhibit gene expression. Thus, both types of sequences from 
either the EST or from the gene corresponding to the EST are 
contemplated within the scope of this invention. 

10 Homopyrimidine oligonucleotides bind to the major groove at 
homopurine : homopyrimidine sequences. As an example, 10-mer 
to 20-mer homopyrimidine sequences from the ESTs can be used 
to inhibit expression from homopurine sequences. SEQ ID NOs 
such as 282, 888, 719, 670, 994, 240, 873 and 761 contain 

15 homopyrimidine 15-mers. Moreover the natural (beta) anomers 
of the oligonucleotide units can be replaced with alpha' 
anomers to render the oligonucleotide more resistant to 
nucleases. Further, an intercalating agent such as ethidium 
bromide, or the like, can be attached to the 3' end of the 

20 alpha oligonucleotide to stabilize the triple helix. For 
information on the generation of oligonucleotides suitable 
for triple helix formation see Griffin et al. (Science 
245:967-971 (1989), which is hereby incorporated by this fc . 
reference) . 

25 The oligonucleotides may be prepared on an 

oligonucleotide synthesizer or they may be purchased 
commercially from a company specializing in custom 
oligonucleotide synthesis. The sequences are introduced into 
cells in culture using techniques known in the art that 

30 include but are not limited to calcium phosphate 

precipitation, DEAE-Dextran, electroporation, liposome- 
mediated transfection or native uptake. Treated cells are 
monitored for altered cell function. These cell functions 
are predicted based upon the homologies of the gene, 

35 corresponding to the EST from which the oligonucleotide was 
derived, with known genes sequences that have been associated 
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with a particular function. The cell functions can also be 
predicted based on the presence of abnormal physiologies 
within cells derived from individuals with a particular 
inherited disease, particularly when the EST is associated 
5 with the disease using techniques described in Example 22. 

: EXAMPLE 25 

Gene expression from DNA Sequences Corresponding to ESTs 

10 

A gene sequence of the present invention coding for all 
or part of a human gene product . is introduced into an 
expression vector using conventional technology. (Techniques 
to transfer cloned sequences into expression vectors that 

15 direct protein translation in mammalian, yeast, insect or 
bacterial expression systems are well known in the art.) 
Commercially available vectors and expression systems are 
available from a variety of suppliers including Stratagene 
(La Jolla, California) , Promega (Madison, Wisconsin), and 

20 . Invitrogen (San Diego, California) . If desired, to enhance 
expression and facilitate proper protein folding, the codon 
context and codon pairing of the sequence may be optimized 
for the particular expression organism, as explained by 
Hatfield, et al. , U.S. Patent No. 5, 082,767, incorporated 

25 herein by this reference. 

The following is provided as one exemplary method to 
generate polypeptide from cloned cDNA sequences. The cDNA 
from the EST of interest is sequenced to identify the 
methionine initiation codon for the gene and the poly A 

30 sequence. If the cDNA lacks a poly A sequence, this sequence 

can be added to the construct by, for example, splicing out 
the Poly A sequence from pSG5 (Stratagene) using Bgll and 
Sail restriction endonuclease enzymes and incorporating it 
into the mammalian expression vector pXTl (Stratagene) . pXTl 

35 contains the LTRs and a portion of the gag gene from Moloney 
Murine Leukemia Virus. The position of the LTRs in the 
construct allow efficient stable transfection. The vector 
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includes the Herpes Simplex Thymidine Kinase promoter and the 
selectable neomycin gene. The cDNA is obtained by PCR from 
the bacterial vector using oligonucleotide primers 
complementary to the cDNA and containing restriction 
5 endonuclease sequences for Pst I incorporated into the 
5' primer and Bglll at the 5' end of the corresponding cDNA 3' 
primer, taking care to ensure that tfle cDNA is position6d 
inframe with the poly A sequence. The purified fragment 
obtained from the resulting PCR reaction is digested with 

10 Pst I, blunt ended with an exonuclease, digested with Bgl II, 
purified and ligated to pXTl # now containing a poly A 
sequence and digested Bglll. 

The ligated product is transfected into mouse NIH 3T3 
cells using Lipofectin (Life Technologies, Inc., Grand 

15 Island, New York) under conditions outlined in the product 
specification. Positive transfectants are selected after 
growing the transfected cells in 600ug/ml G418 (Sigma, St. 
Louis, Missouri) . The protein is preferrably released into 
the supernatant. However if the protein has membrane binding 

20 domains, the protein may additionally be retained within the 
cell or expression may be restricted to the cell surface. 

Since it may be necessary to purify and locate the 
transfected product, synthetic 15-mer peptides synthesized 
from the predicted cDNA sequence are injected into mice to 

25 generate antibody to the polypeptide encoded by the cDNA. 

If antibody production is not possible, the cDNA 
sequence is additionally incorporated into eukaryotic 
expression vectors and expressed as a chimeric with, for 
example, j8-globin. Antibody to 0-globin is used to purify 

30 the chimeric. Corresponding protease cleavage sites 
engineered between the j8-globin gene and the cDNA are then 
used to separate the two polypeptide fragments from one 
another after translation. One useful expression vector for 
generating j3-globin chimerics is pSG5 (Stratagene) . This 

35 vector encodes rabbit 0-globin. Intron II of the rabbit 0- 
globin gene facilitates splicing of the expressed transcript, 
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and the polyadenylation signal incorporated into the 
construct increases the level of expression. These 
techniques as described are well known to those skilled in 
the art of molecular biology. Standard methods are published 
in methods texts such as Davis et al . and many of the methods 
are available from the technical assistance representatives 
from Stratagene, Life Technologies, Inc., or Promega. 
Polypeptide may additionally be produced from either 
construct using in vitro translation systems such as In vitro 
Express™ Translation Kit (Stratagene) . 

Example 26 

Producti on of an Antibody to a Human Protein 

Substantially pure protein or polypeptide is isolated 
from the transfected or transformed cells as described in 
Example 25. Concentration of protein in the final 
preparation is adjusted, for example, by concentration on an 
Amicon filter device, to the level of a few micrograms /ml . 
Monoclonal or polyclonal antibody to the protein can then be 
prepared as follows : 

A. Monoclonal Antibody Production by Hybridoma Fusion 

Monoclonal antibody to epitopes of any of the peptides 
identified and isolated as described can be prepared from 
murine hybridomas according to the classical method of 
Kohler, G. and Milstein, C. , Nature 256:495 (1975) or 
derivative methods thereof. Briefly, a mouse is repetitively 
inoculated with a few micrograms of the selected protein over 
a period of a few weeks. The mouse is then sacrificed, arid 
the antibody producing cells of the spleen isolated. The 
spleen cells are fused by means of polyethylene glycol with 
mouse myeloma cells, and the excess unfused cells destroyed 
by growth of the system on selective media comprising 
aminopterin (HAT media) . The successfully fused cells are 
diluted and aliquots of the dilution placed in wells of a 
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microtiter plate where growth of the culture is continued. 
Antibody-producing clones are identified by detection of 
antibody in the supernatant fluid of the wells by immunoassay 
procedures, such as Elisa, as originally described by 
5 Engvall, E., Math. Enzymol. 70:419 (1980), and derivative 
methods thereof. Selected positive clones can be expanded 
and their monoclonal antibody product harvested for use. 
Detailed procedures for monoclonal antibody production are 
described in Davis, L. et al. Basic Methods in Molecular 

10 Biology Elsevier, New York. Section 21-2. 

B. Polyclonal Antibody Production by Immunization 

Polyclonal antiserum containing antibodies to 
heterogenous epitopes of a single protein can be prepared by 
immunizing suitable animals with the expressed protein 

15 described above, which can be unmodified or modified to 
enhance immunogenicity. Effective polyclonal antibody 
production is affected by many factors related both to the 
antigen and the host species. For example, small molecules 
tend to be less immunogenic than other and may require the 

20 use of carriers and adjuvant. Also, host animals vary in 
response to site of inoculations and dose, with both 
inadequate or excessive doses of antigen resulting in low 
titer antisera. Small doses (ng level) of antigen 
. administered at multiple intradermal sites appears to be most 

25 reliable. An effective immunization protocol for rabbits can 
be found in Vaitukaitis, J. et al. J. Clin* Endocrinol. 
Me tab. 33:988-991 (1971). 

Booster injections can be given at regular intervals, 
and antiserum harvested when antibody titer thereof, as 

30 determined semi -quantitatively, for example, by double 
immunodif fusion in agar against known concentrations of the 
antigen, begins to fall. See, for example, Ouchterlony, 0. 
et al., Chap. 19 in: Handbook of Experimental Immunology D. 
Wier (ed) Blackwell (1973) . Plateau concentration of 

35 antibody is usually in the range of 0.1 to 0.2 mg/ml of serum 

(about 12 fiM) . Affinity of the antisera for the antigen is 
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determined by preparing competitive binding curves, as 
described, for example, by Fisher, D., Chap. 42 in: Manual of 
Clinical Immunology, 2d Ed. (Rose and Friedman, eds.) Amer. 
Soc, For Microbiol . , Washington, D.C. (1980). 
5 Antibody preparations prepared according to either 

protocol are useful in quantitative immunoassays which 
" determine concentrations of antigen-bearing substances in 
biological samples; they are also used semi-quantitatively or 
qualitatively to identify the presence of antigen in a 
10 biological sample. 

EXAMPLE 27 

Identification of Tissue Types or Cell Species by Means of 
Labeled Tissue Specific Antibodies 

15 

Identification of specific tissues is accomplished by 
the visualization of tissue specific antigens by means of 
antibody preparations according to Example 26 which are 
conjugated, directly or indirectly to a detectable marker. 
20 Selected labeled antibody species bind to their specific 
antigen binding partner in tissue sections, cell suspensions, 
or in extracts of soluble proteins from a tissue sample to 
provide a pattern for qualitative or semi -qualitative 
interpret at ion . 

25 Antisera. for these procedures must have a potency 

exceeding that of the native preparation, and for that 
reason, antibodies are concentrated to a mg/ml level by 
isolation of the gamma globulin fraction, for example, by 
ion-exchange chromatography or by ammonium sulfate 

30 fractionation. Also, to provide the most specific antisera, 

unwanted antibodies, for example to common proteins, must be 
removed from the gamma globulin fraction, for example by 
means of insoluble immunoabsorbents, before the antibodies 
are labeled with, the marker. Either monoclonal or 

35 heterologous antisera is suitable for either procedure. 
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A. Ismranohistochemical Techniques 

Purified, high- titer antibodies, prepared as described 
above, are conjugated to a detectable marker, as described, 
for example, by Fudenberg, H., Chap, 26 in: Basic & Clinical 
5 Immunology, 3rd Ed. Lange, Los Altos, California (1980) or 
Rose, N. et al., Chap. 12 in: Methods in Immunodiagnosis , 2d 
Ed. John Wiley & Sons, New Yori (1980) . 

A fluorescent marker, either fluorescein or rhodamine, 
is preferred, but antibodies can also be labeled with an 

10 enzyme that supports a color producing reaction with a 
substrate, such as horseradish peroxidase. Markers can be 
added to tissue-bound antibody in a second step, as described 
below. Alternatively, the specific antit issue antibodies can 
be labeled with ferritin or other electron dense particles, 

15 and localization of the ferritin coupled antigen-antibody 
complexes achieved by means of an electron microscope. In 
yet another approach, the antibodies are radiolabeled, with, 
for example m l, and detected by overlaying the antibody 
treated preparation with photographic emulsion. 

20 Preparations to carry out the procedures can comprise 

monoclonal or polyclonal antibodies to a single gene copy or 
protein, identified as specific to a tissue type, for 
example, brain tissue, or antibody preparations to several 
antigenically distinct tissue specific antigens can be used 

25 in panels, independently or in mixtures, as required. 

Tissue sections and cell suspensions are prepared for 
immunohistochemical examination according to common 
histological techniques. Multiple cryostat sections (about 
4 pm, unfixed) of the unknown tissue and known control, are 

30 mounted and each slide covered with different dilutions of 
the antibody preparation. Sections of known and unknown 
tissues should also be treated with preparations to provide 
a positive control, a negative control, for example, pre- 
immune. sera, and a control for non-specific staining, for 

35 example, buffer. 
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Treated sections are incubated in a humid chamber for 
30 min at room temperature, rinsed, then washed in buffer for 
30-45 min. Excess fluid is blotted away, and the marker 
developed. 

5 If the tissue specific antibody was not labeled in the 

first incubation, it can be labeled at this time in a second 
antibody- antibody reaction, for example, by adding 
fluorescein- or enzyme -conjugated antibody against the 
immunoglobulin class of the antiserum- producing species, for 

10 example, fluorescein labeled antibody to mouse IgG. * Such 
labeled sera are commercially available. 

The antigen found in the tissues by the above procedure 
can be quantified by measuring the intensity of color or 
fluorescence on the tissue section, and calibrating that 

15 signal using appropriate standards. 

B. Identification of Tissue Specific Soluble Proteins 

The visualization of tissue specific proteins and 
identification of unknown tissues from that procedure is 
carried out using the labeled antibody reagents arid detection 

20 strategy as described for immunohistochemistry; however the 
sample is prepared according to an electrophoretic technique 
to distribute the proteins extracted from the tissue in an 
orderly array on the basis of molecular weight for detection. 
A tissue sample is homogenized using a Virtis apparatus; 

25 cell suspensions are disrupted by Dounce homogenization or 
osmotic lysis, using detergents in either case as required to 
disrupt cell membranes, as is the practice in the art. 
Insoluble cell components such as nuclei, microsomes, and 
membrane fragments are removed by ultracentrifugation, and 

30 the soluble protein-containing fraction concentrated if 
necessary and reserved for analysis. 

A sample of the soluble protein solution is resolved 
into individual protein species by conventional SDS 
polyacrylamide electrophoresis as described, for example, by 

35 Davis, L. et al . , Section 19-2 in: Basic Methods in Molecular 

Biology (P. Leder, ed) , Elsevier, New York (1986), using a 
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range of amounts of polyacrylamide in a set of gels to 
resolve the entire molecular weight range of proteins to be 
detected in the sample. A size marker is run in parallel for 
purposes of estimating molecular weights of the constituent 
5 proteins. Sample size for analysis is a convenient volume of 
from 5-50 fil, and containing from about 1 to 100 fig protein. 
An aliquot; of each of the resolved proteins is transferred by 
blotting to a nitrocellulose filter paper, a process that 
maintains the pattern of resolution. Multiple copies are 

10 prepared. The procedure, known as Western Blot Analysis is 
well described in Davis, L. et al., (above) Section 19-3. 
One set of nitrocellulose blots is stained with Coomassie 
Blue dye to visualize the entire set of proteins for 
comparison with the antibody bound proteins. The remaining 

15 nitrocellulose filters are then incubated with a solution of 
one or more specific antisera to tissue specific proteins 
prepared as described in Example 26. In this procedure, as in 
procedure A above, appropriate positive and negative sample 
and reagent controls are run. 

20 In either procedure A or B, a detectable label can be 

attached to the primary tissue antigen-primary antibody 
. complex according to various strategies and permutations 
thereof . In a straightforward approach, the primary specific 
antibody can be labeled; alternatively, the unlabeled complex 

25 can be bound by a labeled secondary anti-IgG antibody. In 

other approaches, either the primary or secondary antibody is 
conjugated to a biotin molecule, which can, in a subsequent 
step, bind an avidin conjugated marker. According to yet 
another strategy, enzyme labeled or radioactive protein A, 

30 which has the property of binding to any IgG, is bound in a 
final step to either the primary or secondary antibody. 

The visualization of tissue specific antigen binding at 
levels above those seen in control tissues to one or more 
tissue specific antibodies, prepared from the gene sequences 

35 identified from EST sequences, can identify tissues of 

unknown origin, for example, forensic samples, or 
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de- 
differentiated tumor tissue that has metastasized to foreign 
bodily sites. 

The entire contents of all references cited above are 
hereby incorporated by reference. 
5 While the present invention has been described in some 

detail for purposes of clarity and understanding, one skilled 
in the art will appreciate that various changes in form and 
detail can be made without departing from the true scope of 
the invention. 

10 

VII • Correlation of EST and Clone Identifiers 

The EST sequences of the present invention are 
identified herein by SEQ ID NO, and are identified in the 
GenBank database by a different number, are identified in the 

15 inventors 9 lab (and upcoming publications) by EST number, and 
clones have been submitted to the American Type Culture 
Collection (Rockville, Maryland USA) under clone nam6s. 
Table 12 cross references those different numbers for the 
ESTs from CDNA, SEQ ID NOS 1-2409. 

20 Certain Sequence ID NOS are excluded from some claims 

based on their homology to known non-human sequences (See 
Table 2) . 
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NOTE REGARDING SEQUENCE LISTINGS: The listings of SEQ ID 
NOS: 1-2421 are in numerical order. However, an occasional 
number (for example, SEQ ID NO: 44) is not found in this 
list. In all, 9 SEQ ID NOS are not used* Nevertheless, the 
convention "i-2421" is used, for example, to refer to all the 
SEQ ID NOS in the following list, while "1-315" is used, for 
example, to refer :to all the listed sequences falling between 
SEQ ID NO 1 and SEQ ID NO 315, 
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SEQUENCE LISTING 

(1) GENERAL INFORMATION: 

(i) APPLICANT: Venter, J. Craig 
Adams, Mark D. 
Moreno, Ruben F. 

(ii) TITLE OF INVENTION: Sequences Characteristic of Human Gene 

Transcription Product 
(iii) NUMBER OF SEQUENCES : 2412 (1-2421, with 9 SEQ ID NOS unused.) 

(iv) CORRESPONDENCE ADDRESS: 

(A) ADDRESSEE: Knobbe, Martens , Olson, and Bear 

(B) STREET: 620 Newport Center Dr. Sixteenth Floor 

(C) CITY: Newport Beach 

(D) STATE: CA 

(E) COUNTRY: USA 

(F) ZIP: 92660 

(v) COMPUTER READABLE FORM: 

(A) MEDIUM TYPE: Floppy disk 

(B) COMPUTER: IBM PC compatible 

(C) OPERATING SYSTEM: PC-DOS/MS-DOS 

(D) SOFTWARE: Patent In Release #1.0, Version #1.25 

(Vi) CURRENT APPLICATION DATA: 

(A) APPLICATION NUMBER: 07/837,195 

(B) FILING DATE: 12-FEB-1992 

(vii) PRIOR APPLICATION DATA: 

(A) APPLICATION NUMBER: US 07/716,831 

(B) FILING DATE: 20-JUN-1991 

(Viii) ATTORNEY/AGENT INFORMATION: 

(A) NAME: Israelsen, Ned A. 

(B) REGISTRATION NUMBER: 29,655 

(C) REFERENCE/DOCKET NUMBER: NIH004 . 004CP1 

(ix) TELECOMMUNICATION INFORMATION: 

(A) TELEPHONE: 619-235-8550 

(B) TELEFAX: 619-235-0176 

SEP ID NO:l: (Length of Sequence = 362 Nucleotides) 

CriXXCmT GTICCCCICA glXJlXXXTlT TAA3TGCITC CCTCCATTTT CCTTAGCAGC ATCCTAGTTG A3GCTCTGGG 
TTATCAC3tf3G AGCAAAAACA TITAAGTGTC AAATRAIGCT CAl'ICTCTCC CTQQGAT1TC TAAACAGAAA AAATGAAGAA 
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(acraCGGCG GOCTC3GTCAQ OCT CA TOOOC CICAIGCIM ACTIAATGAA AGAATQGWTC G0CX3OTOG AGAACTTOOC 
GC3GCAAGGTG CIGCAGGGCA OGOQGQCTGC CIQCACT 

SEP ID NO:l931: (Length of Sequence = 343 Nucleotides) 

ATCACTTCCC CACCCCACAG GAICTGCCCC AGAGAAAGTC CTOGCTOGTC ACCAGCAAGC TTC0GQC7P3G CCAAGTITGAA 
TGATCCIGCC GGGGGCnCTC CEAGATOCTG AGAOGCTIOC CXHCCCTGOC CGACCOGGGT CCTGTGCTGG NTCCTGCCCC 

yixriucm ' tgcagocagg gctcaggagg tggctogggt ctgggctqga gaggcagaag eccmccrc ttoctgtccc 

ASCBCATQGA GCCCCTK3GG CTGAGCACCA AGACCTTGAA CCTTTTTTGT TmCCITIT TTOCAAAXAA CMJ1T1UGAG 
AAAIATCAAT GAAATCTGGG GGT 

SEP ID NO: 1932: (Length of Sequence = 314 Nucleotides) 

VXT UAa Wtf i 1 TiTlWmU TTmrnCGA AEACIGAAAA AOTOCTTIGG GCrCICTGGG tiTiOXXACS CTCAOGGCTC 

citictocca cactcactgc ccnrcrrcoc acagcaaatc tatttcaagg acagtacitt ttaaaatgat taatottgag 

TICTCAACTA GCTCTGCAGA ACBY3AGGAG CICTTIGCAT CTCTCTGTGC GGATOGACTT TL'ITITATCT GACACCAGCT 
CTCX3\A0CAC ACIGATCCAA GGCA3TITAT CTACAGAGCT CAACTAGAAC CCCTITITCA TTAGGCXRCT OCAA 

SBO ID ND:1933: (Length of Sequence = 378 Nucleotides) 

AGCTTCCTGC GQGACCACAG CTATGTOACT GAAGCTGACA TCATCTCI3VC CCTTGAGTTC AACCACAGGG GAGAGCTGCT 
GGCCACAGGT GACAAGGGOG GOOGGGTOGT CMCnOCAG OQQGAACCAG AGAC7EAAAAA TGOQCOOCAC AGCCAGGGCG 
AATAOGAOCT GTACAGCACT TICCAGAGCC AGGAGCOGGA GTTTGACTAT CTCAAGAGCC TGGAGAIAGA GGAGAAOTC 
AACAAGATCA AGIQ3CTCCC ACAGCAGAAC GCOGCCCACT CACTCCICTT CCACCAAOGA TAAAACTATC AAATTATOGA 
AGA3TOOOGA AOGAGAXAA& AQQCCOGAAG GATACAACCT GAAGGATGAA GAGGGGAA 

SEP ID NO: 1934 : (Length of Sequence = 239 Nucleotides) 

ATTEAAATIG ACAGCCTTOC ATITrTOGEAG AAACTACAAA QGAACTGCT TTAGCACCCA TOGAGCCCCA AACGGCTAAG 
GTAAGCCAAG GTTTTAATGA CCAGCCCACT ATCIAAGCIT CCAAAOQGAT GOWXX3CAT CACAXACTXA CCCTGGGAGG 
CTGCIGCACG GGCATTCTCC YCTTGCTCAC GGCACITGGK GTAGOTITCA RGATOGCCTC TTTGAGGAAG GACTTCAGG 

SEP ID NO: 1935: (Length of Sequence = 319 Nucleotides) 

TEAA 1T1T1T TOTCCCATAG AGGAATAGCA TTACMJICTPl ACAATCAGAA TICKnTACA CACATACACA GGCA3G0CAC 
AIGAGQCACT TGAGGTOCTT (HNTGCriGA GTCTGTIGAC ACCTCACATG CTCAAACTCT CCTCATITCA GCCAGTCTCA 
ACAGAAAACA CCCAACAGGG AIGCACICAA CriWlWlT CCATOIGGAA CXAGCTQGCA GGGOGAGAGG GAAACTAGTA 
GAAGGGGGCT ATGGTCTCTC TGCA3TCAGT CCCCTCACAT AAAGCCAGAX GGATCTAGGG GGQCTATCCA AGAGCICIG 

SBO ID NO:1936: (Length of Sequence = 415 Nucleotides) 

CrATTTTTAC AAATTA3AOC TAAIGACTAA AATTACTGTA AACTGAIAAC ATCCITCTAC CTGTA3TTCT AGTGACCCIT 
TAGOGGCAGG TAITEATACC TGGTATTEAT GATOCACTAT ATAACTGCTG AACAATAACT GACAGrTATIG TGCTIGCTGfr 
ACATCTCTOG TCTTTIGAAA CAG Al'lTlA G TAAGCATTIT CCAGAGCTAA AACTCICTCC TTATTCTAAT TTTATTOCTA 
GGGCAAAGTA GACAGGGATT A T1TULT1U A ATCTAnTCC AAATTAATAT TTITncnT GGTAnTCTA CACTTTAAGG 
OCAnTOGIG CAATTTAGAA AGTCTTGGCC TCCCITCCGC TAGGGACATT CAAAATTAAC TIOCAAAAOC TCAGGAACAG 
TACAAGGAAT TTGAA 



